Evaluation of the RAG Pipeline¶
Introduction¶
In this notebook, we’ll pick up where it was left off by turning our attention to the patent side of things and then benchmarking the RAG pipeline with our generated Q&A questions. First, we’ll explore the Cleantech Google Patent dataset in detail to understand its structure and peculiarities. Next, we’ll prepare and index embeddings for both our media and patent corpora, so that our RAG system has a solid knowledge base to retrieve from. Finally, we’ll leverage the reference implementation provided by Prof. Dr. Daniel Perruchoud to run end-to-end evaluations over:
The original “seed” evaluation set,
Our QA generated pairs generated.
We’ll break down performance by question category and by retrieval relevance, so that we can identify strengths, weaknesses, and opportunities for improvement in the RAG system. Let’s get started!
Setup¶
To run this notebook we recommend downloading the provided GitHub repository and opening this notebook in Google Colab. To ensure a smooth experience, you'll need:
- An OpenAI API key for GPT-4o
- A Google account for Google Colab
- Python packages (automatically installed within the notebook)
At the start of the notebook a data.zip will be downloaded from a Google Drive and unzipped. This will then provide you with files that contain checkpoints for all of the expensive processing sections such as chunking, generating embeddings and evaluating the pipeline with an LLM as a judge. This saves you money and a lot of time.
If you can't or don't want to run this notebook you can also view the completed notebook by opening the cleantech_rag.html file in your browser.
Setting your OpenAI Key¶
This OpenAI Key is used for the following tasks:
- Generating embeddings for our semantic search.
- Leveraging GPT-4-turbo as our LLM in our RAG pipeline and as our judge in RAGAS.
%%writefile .env
OPENAI_API_KEY=''
Writing .env
After executing the above cell, you should restart the kernel/runtime to ensure the key is properly set.
Installing Dependencies¶
%%writefile requirements.txt
chromadb==0.5.0
datasets==2.19.1
gdown==5.2.0
kaggle==1.6.1
langchain==0.3.10
langchain-community==0.3.10
langchain-experimental==0.3.3
langchain-openai==0.2.12
langdetect==1.0.9
lorem-text==2.1
nbformat>=4.2.0
openai==1.57.1
plotly==5.22.0
pretty-jupyter==1.0
ragas==0.1.8
seaborn==0.13.2
sentence-transformers==3.0.0
spacy>=3.7
textstat==0.7.3
umap-learn==0.5.5
Writing requirements.txt
Before we dive into embeddings and RAG evaluation, we need to ensure our Python environment has the exact versions of PyTorch, fsspec, and related dependencies that are compatible with the rest of our tooling and GPU runtime. To archieve this run the commands bellow:
%pip install torch==2.5.1 --quiet --index-url https://download.pytorch.org/whl/cu121
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 780.5/780.5 MB 1.9 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 76.2 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 kB 46.2 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 101.6 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 2.2 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 2.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 21.0 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 44.6 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 7.7 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 4.7 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 kB 8.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.5/209.5 MB 5.0 MB/s eta 0:00:00 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. torchaudio 2.6.0+cu124 requires torch==2.6.0, but you have torch 2.5.1+cu121 which is incompatible. torchvision 0.21.0+cu124 requires torch==2.6.0, but you have torch 2.5.1+cu121 which is incompatible.
%pip install -r ./requirements.txt --quiet
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/84.5 kB ? eta -:--:-- ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84.5/84.5 kB 2.4 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 981.5/981.5 kB 18.9 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90.9/90.9 kB 8.1 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67.3/67.3 kB 6.0 MB/s eta 0:00:00 Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.0/61.0 kB 5.3 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 526.8/526.8 kB 32.7 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 542.0/542.0 kB 40.6 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 58.2 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.4/2.4 MB 87.5 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.0/209.0 kB 20.1 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.7/50.7 kB 4.6 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 389.8/389.8 kB 31.2 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.4/16.4 MB 93.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.3/4.3 MB 105.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84.1/84.1 kB 8.1 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 224.7/224.7 kB 20.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 105.1/105.1 kB 10.6 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.4/2.4 MB 91.9 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 284.2/284.2 kB 26.2 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 kB 11.7 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 95.2/95.2 kB 8.3 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 172.0/172.0 kB 14.9 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 91.3 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 311.8/311.8 kB 27.0 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 101.6/101.6 kB 10.0 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 113.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.0/16.0 MB 123.6 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55.9/55.9 kB 5.1 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92.0/92.0 kB 8.6 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.4/44.4 kB 4.1 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.1/71.1 kB 6.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.9/3.9 MB 96.0 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 64.1 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.5/62.5 kB 6.0 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 143.5/143.5 kB 14.0 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 82.5 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.8/194.8 kB 18.1 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.7/11.7 MB 129.7 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 459.8/459.8 kB 39.1 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 81.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.9/50.9 kB 4.3 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72.0/72.0 kB 6.2 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.0/4.0 MB 112.3 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 454.8/454.8 kB 35.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46.0/46.0 kB 4.2 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86.8/86.8 kB 8.3 MB/s eta 0:00:00 Building wheel for kaggle (setup.py) ... done Building wheel for langdetect (setup.py) ... done Building wheel for umap-learn (setup.py) ... done Building wheel for pypika (pyproject.toml) ... done ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. torchaudio 2.6.0+cu124 requires torch==2.6.0, but you have torch 2.5.1+cu121 which is incompatible. gcsfs 2025.3.2 requires fsspec==2025.3.2, but you have fsspec 2024.3.1 which is incompatible. torchvision 0.21.0+cu124 requires torch==2.6.0, but you have torch 2.5.1+cu121 which is incompatible.
!pip install --upgrade fsspec==2025.3.0
Collecting fsspec==2025.3.0 Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB) Downloading fsspec-2025.3.0-py3-none-any.whl (193 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/193.6 kB ? eta -:--:-- ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━ 174.1/193.6 kB 5.9 MB/s eta 0:00:01 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 193.6/193.6 kB 4.0 MB/s eta 0:00:00 Installing collected packages: fsspec Attempting uninstall: fsspec Found existing installation: fsspec 2024.3.1 Uninstalling fsspec-2024.3.1: Successfully uninstalled fsspec-2024.3.1 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. datasets 2.19.1 requires fsspec[http]<=2024.3.1,>=2023.1.0, but you have fsspec 2025.3.0 which is incompatible. torchaudio 2.6.0+cu124 requires torch==2.6.0, but you have torch 2.5.1+cu121 which is incompatible. gcsfs 2025.3.2 requires fsspec==2025.3.2, but you have fsspec 2025.3.0 which is incompatible. torchvision 0.21.0+cu124 requires torch==2.6.0, but you have torch 2.5.1+cu121 which is incompatible. Successfully installed fsspec-2025.3.0
!pip check
datasets 2.19.1 has requirement fsspec[http]<=2024.3.1,>=2023.1.0, but you have fsspec 2025.3.0. torchaudio 2.6.0+cu124 has requirement torch==2.6.0, but you have torch 2.5.1+cu121. gcsfs 2025.3.2 has requirement fsspec==2025.3.2, but you have fsspec 2025.3.0. torchvision 0.21.0+cu124 has requirement torch==2.6.0, but you have torch 2.5.1+cu121.
To ensure that all newly installed packages are properly loaded and any old versions cleared from memory, please restart your notebook kernel after this cell. Once the kernel is back up, you’ll be ready to proceed with installing pytorch and importing the needed libraries.
!pip install --upgrade --force-reinstall gcsfs fsspec
Collecting gcsfs
Downloading gcsfs-2025.3.2-py2.py3-none-any.whl.metadata (1.9 kB)
Collecting fsspec
Downloading fsspec-2025.3.2-py3-none-any.whl.metadata (11 kB)
Collecting aiohttp!=4.0.0a0,!=4.0.0a1 (from gcsfs)
Downloading aiohttp-3.11.18-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting decorator>4.1.2 (from gcsfs)
Downloading decorator-5.2.1-py3-none-any.whl.metadata (3.9 kB)
Collecting google-auth>=1.2 (from gcsfs)
Downloading google_auth-2.39.0-py2.py3-none-any.whl.metadata (6.2 kB)
Collecting google-auth-oauthlib (from gcsfs)
Downloading google_auth_oauthlib-1.2.2-py3-none-any.whl.metadata (2.7 kB)
Collecting google-cloud-storage (from gcsfs)
Downloading google_cloud_storage-3.1.0-py2.py3-none-any.whl.metadata (12 kB)
Collecting requests (from gcsfs)
Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting aiohappyeyeballs>=2.3.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->gcsfs)
Downloading aiohappyeyeballs-2.6.1-py3-none-any.whl.metadata (5.9 kB)
Collecting aiosignal>=1.1.2 (from aiohttp!=4.0.0a0,!=4.0.0a1->gcsfs)
Downloading aiosignal-1.3.2-py2.py3-none-any.whl.metadata (3.8 kB)
Collecting attrs>=17.3.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->gcsfs)
Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
Collecting frozenlist>=1.1.1 (from aiohttp!=4.0.0a0,!=4.0.0a1->gcsfs)
Downloading frozenlist-1.6.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting multidict<7.0,>=4.5 (from aiohttp!=4.0.0a0,!=4.0.0a1->gcsfs)
Downloading multidict-6.4.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.3 kB)
Collecting propcache>=0.2.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->gcsfs)
Downloading propcache-0.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting yarl<2.0,>=1.17.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->gcsfs)
Downloading yarl-1.20.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (72 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72.4/72.4 kB 2.9 MB/s eta 0:00:00
Collecting cachetools<6.0,>=2.0.0 (from google-auth>=1.2->gcsfs)
Downloading cachetools-5.5.2-py3-none-any.whl.metadata (5.4 kB)
Collecting pyasn1-modules>=0.2.1 (from google-auth>=1.2->gcsfs)
Downloading pyasn1_modules-0.4.2-py3-none-any.whl.metadata (3.5 kB)
Collecting rsa<5,>=3.1.4 (from google-auth>=1.2->gcsfs)
Downloading rsa-4.9.1-py3-none-any.whl.metadata (5.6 kB)
Collecting requests-oauthlib>=0.7.0 (from google-auth-oauthlib->gcsfs)
Downloading requests_oauthlib-2.0.0-py2.py3-none-any.whl.metadata (11 kB)
Collecting google-api-core<3.0.0dev,>=2.15.0 (from google-cloud-storage->gcsfs)
Downloading google_api_core-2.24.2-py3-none-any.whl.metadata (3.0 kB)
Collecting google-cloud-core<3.0dev,>=2.4.2 (from google-cloud-storage->gcsfs)
Downloading google_cloud_core-2.4.3-py2.py3-none-any.whl.metadata (2.7 kB)
Collecting google-resumable-media>=2.7.2 (from google-cloud-storage->gcsfs)
Downloading google_resumable_media-2.7.2-py2.py3-none-any.whl.metadata (2.2 kB)
Collecting google-crc32c<2.0dev,>=1.0 (from google-cloud-storage->gcsfs)
Downloading google_crc32c-1.7.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.3 kB)
Collecting charset-normalizer<4,>=2 (from requests->gcsfs)
Downloading charset_normalizer-3.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (35 kB)
Collecting idna<4,>=2.5 (from requests->gcsfs)
Downloading idna-3.10-py3-none-any.whl.metadata (10 kB)
Collecting urllib3<3,>=1.21.1 (from requests->gcsfs)
Downloading urllib3-2.4.0-py3-none-any.whl.metadata (6.5 kB)
Collecting certifi>=2017.4.17 (from requests->gcsfs)
Downloading certifi-2025.4.26-py3-none-any.whl.metadata (2.5 kB)
Collecting googleapis-common-protos<2.0.0,>=1.56.2 (from google-api-core<3.0.0dev,>=2.15.0->google-cloud-storage->gcsfs)
Downloading googleapis_common_protos-1.70.0-py3-none-any.whl.metadata (9.3 kB)
Collecting protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<7.0.0,>=3.19.5 (from google-api-core<3.0.0dev,>=2.15.0->google-cloud-storage->gcsfs)
Downloading protobuf-6.30.2-cp39-abi3-manylinux2014_x86_64.whl.metadata (593 bytes)
Collecting proto-plus<2.0.0,>=1.22.3 (from google-api-core<3.0.0dev,>=2.15.0->google-cloud-storage->gcsfs)
Downloading proto_plus-1.26.1-py3-none-any.whl.metadata (2.2 kB)
Collecting pyasn1<0.7.0,>=0.6.1 (from pyasn1-modules>=0.2.1->google-auth>=1.2->gcsfs)
Downloading pyasn1-0.6.1-py3-none-any.whl.metadata (8.4 kB)
Collecting oauthlib>=3.0.0 (from requests-oauthlib>=0.7.0->google-auth-oauthlib->gcsfs)
Downloading oauthlib-3.2.2-py3-none-any.whl.metadata (7.5 kB)
Downloading gcsfs-2025.3.2-py2.py3-none-any.whl (36 kB)
Downloading fsspec-2025.3.2-py3-none-any.whl (194 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.4/194.4 kB 7.8 MB/s eta 0:00:00
Downloading aiohttp-3.11.18-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 39.2 MB/s eta 0:00:00
Downloading decorator-5.2.1-py3-none-any.whl (9.2 kB)
Downloading google_auth-2.39.0-py2.py3-none-any.whl (212 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 212.3/212.3 kB 21.9 MB/s eta 0:00:00
Downloading google_auth_oauthlib-1.2.2-py3-none-any.whl (19 kB)
Downloading google_cloud_storage-3.1.0-py2.py3-none-any.whl (174 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 174.9/174.9 kB 15.5 MB/s eta 0:00:00
Downloading requests-2.32.3-py3-none-any.whl (64 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.9/64.9 kB 6.5 MB/s eta 0:00:00
Downloading aiohappyeyeballs-2.6.1-py3-none-any.whl (15 kB)
Downloading aiosignal-1.3.2-py2.py3-none-any.whl (7.6 kB)
Downloading attrs-25.3.0-py3-none-any.whl (63 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.8/63.8 kB 5.3 MB/s eta 0:00:00
Downloading cachetools-5.5.2-py3-none-any.whl (10 kB)
Downloading certifi-2025.4.26-py3-none-any.whl (159 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 159.6/159.6 kB 15.8 MB/s eta 0:00:00
Downloading charset_normalizer-3.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (143 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 143.9/143.9 kB 15.4 MB/s eta 0:00:00
Downloading frozenlist-1.6.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (313 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 313.6/313.6 kB 30.7 MB/s eta 0:00:00
Downloading google_api_core-2.24.2-py3-none-any.whl (160 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 160.1/160.1 kB 16.7 MB/s eta 0:00:00
Downloading google_cloud_core-2.4.3-py2.py3-none-any.whl (29 kB)
Downloading google_crc32c-1.7.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (32 kB)
Downloading google_resumable_media-2.7.2-py2.py3-none-any.whl (81 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81.3/81.3 kB 8.0 MB/s eta 0:00:00
Downloading idna-3.10-py3-none-any.whl (70 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70.4/70.4 kB 6.9 MB/s eta 0:00:00
Downloading multidict-6.4.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (223 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 223.5/223.5 kB 22.9 MB/s eta 0:00:00
Downloading propcache-0.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (232 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 232.5/232.5 kB 22.3 MB/s eta 0:00:00
Downloading pyasn1_modules-0.4.2-py3-none-any.whl (181 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 181.3/181.3 kB 17.5 MB/s eta 0:00:00
Downloading requests_oauthlib-2.0.0-py2.py3-none-any.whl (24 kB)
Downloading rsa-4.9.1-py3-none-any.whl (34 kB)
Downloading urllib3-2.4.0-py3-none-any.whl (128 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 128.7/128.7 kB 17.4 MB/s eta 0:00:00
Downloading yarl-1.20.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (358 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 358.1/358.1 kB 34.7 MB/s eta 0:00:00
Downloading googleapis_common_protos-1.70.0-py3-none-any.whl (294 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 294.5/294.5 kB 28.4 MB/s eta 0:00:00
Downloading oauthlib-3.2.2-py3-none-any.whl (151 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 151.7/151.7 kB 15.7 MB/s eta 0:00:00
Downloading proto_plus-1.26.1-py3-none-any.whl (50 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.2/50.2 kB 4.7 MB/s eta 0:00:00
Downloading protobuf-6.30.2-cp39-abi3-manylinux2014_x86_64.whl (316 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 316.2/316.2 kB 29.8 MB/s eta 0:00:00
Downloading pyasn1-0.6.1-py3-none-any.whl (83 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83.1/83.1 kB 8.2 MB/s eta 0:00:00
Installing collected packages: urllib3, pyasn1, protobuf, propcache, oauthlib, multidict, idna, google-crc32c, fsspec, frozenlist, decorator, charset-normalizer, certifi, cachetools, attrs, aiohappyeyeballs, yarl, rsa, requests, pyasn1-modules, proto-plus, googleapis-common-protos, google-resumable-media, aiosignal, requests-oauthlib, google-auth, aiohttp, google-auth-oauthlib, google-api-core, google-cloud-core, google-cloud-storage, gcsfs
Attempting uninstall: urllib3
Found existing installation: urllib3 2.4.0
Uninstalling urllib3-2.4.0:
Successfully uninstalled urllib3-2.4.0
Attempting uninstall: pyasn1
Found existing installation: pyasn1 0.6.1
Uninstalling pyasn1-0.6.1:
Successfully uninstalled pyasn1-0.6.1
Attempting uninstall: protobuf
Found existing installation: protobuf 5.29.4
Uninstalling protobuf-5.29.4:
Successfully uninstalled protobuf-5.29.4
Attempting uninstall: propcache
Found existing installation: propcache 0.3.1
Uninstalling propcache-0.3.1:
Successfully uninstalled propcache-0.3.1
Attempting uninstall: oauthlib
Found existing installation: oauthlib 3.2.2
Uninstalling oauthlib-3.2.2:
Successfully uninstalled oauthlib-3.2.2
Attempting uninstall: multidict
Found existing installation: multidict 6.4.3
Uninstalling multidict-6.4.3:
Successfully uninstalled multidict-6.4.3
Attempting uninstall: idna
Found existing installation: idna 3.10
Uninstalling idna-3.10:
Successfully uninstalled idna-3.10
Attempting uninstall: google-crc32c
Found existing installation: google-crc32c 1.7.1
Uninstalling google-crc32c-1.7.1:
Successfully uninstalled google-crc32c-1.7.1
Attempting uninstall: fsspec
Found existing installation: fsspec 2025.3.0
Uninstalling fsspec-2025.3.0:
Successfully uninstalled fsspec-2025.3.0
Attempting uninstall: frozenlist
Found existing installation: frozenlist 1.6.0
Uninstalling frozenlist-1.6.0:
Successfully uninstalled frozenlist-1.6.0
Attempting uninstall: decorator
Found existing installation: decorator 4.4.2
Uninstalling decorator-4.4.2:
Successfully uninstalled decorator-4.4.2
Attempting uninstall: charset-normalizer
Found existing installation: charset-normalizer 3.4.1
Uninstalling charset-normalizer-3.4.1:
Successfully uninstalled charset-normalizer-3.4.1
Attempting uninstall: certifi
Found existing installation: certifi 2025.1.31
Uninstalling certifi-2025.1.31:
Successfully uninstalled certifi-2025.1.31
Attempting uninstall: cachetools
Found existing installation: cachetools 5.5.2
Uninstalling cachetools-5.5.2:
Successfully uninstalled cachetools-5.5.2
Attempting uninstall: attrs
Found existing installation: attrs 25.3.0
Uninstalling attrs-25.3.0:
Successfully uninstalled attrs-25.3.0
Attempting uninstall: aiohappyeyeballs
Found existing installation: aiohappyeyeballs 2.6.1
Uninstalling aiohappyeyeballs-2.6.1:
Successfully uninstalled aiohappyeyeballs-2.6.1
Attempting uninstall: yarl
Found existing installation: yarl 1.20.0
Uninstalling yarl-1.20.0:
Successfully uninstalled yarl-1.20.0
Attempting uninstall: rsa
Found existing installation: rsa 4.9.1
Uninstalling rsa-4.9.1:
Successfully uninstalled rsa-4.9.1
Attempting uninstall: requests
Found existing installation: requests 2.32.3
Uninstalling requests-2.32.3:
Successfully uninstalled requests-2.32.3
Attempting uninstall: pyasn1-modules
Found existing installation: pyasn1_modules 0.4.2
Uninstalling pyasn1_modules-0.4.2:
Successfully uninstalled pyasn1_modules-0.4.2
Attempting uninstall: proto-plus
Found existing installation: proto-plus 1.26.1
Uninstalling proto-plus-1.26.1:
Successfully uninstalled proto-plus-1.26.1
Attempting uninstall: googleapis-common-protos
Found existing installation: googleapis-common-protos 1.70.0
Uninstalling googleapis-common-protos-1.70.0:
Successfully uninstalled googleapis-common-protos-1.70.0
Attempting uninstall: google-resumable-media
Found existing installation: google-resumable-media 2.7.2
Uninstalling google-resumable-media-2.7.2:
Successfully uninstalled google-resumable-media-2.7.2
Attempting uninstall: aiosignal
Found existing installation: aiosignal 1.3.2
Uninstalling aiosignal-1.3.2:
Successfully uninstalled aiosignal-1.3.2
Attempting uninstall: requests-oauthlib
Found existing installation: requests-oauthlib 2.0.0
Uninstalling requests-oauthlib-2.0.0:
Successfully uninstalled requests-oauthlib-2.0.0
Attempting uninstall: google-auth
Found existing installation: google-auth 2.38.0
Uninstalling google-auth-2.38.0:
Successfully uninstalled google-auth-2.38.0
Attempting uninstall: aiohttp
Found existing installation: aiohttp 3.11.15
Uninstalling aiohttp-3.11.15:
Successfully uninstalled aiohttp-3.11.15
Attempting uninstall: google-auth-oauthlib
Found existing installation: google-auth-oauthlib 1.2.2
Uninstalling google-auth-oauthlib-1.2.2:
Successfully uninstalled google-auth-oauthlib-1.2.2
Attempting uninstall: google-api-core
Found existing installation: google-api-core 2.24.2
Uninstalling google-api-core-2.24.2:
Successfully uninstalled google-api-core-2.24.2
Attempting uninstall: google-cloud-core
Found existing installation: google-cloud-core 2.4.3
Uninstalling google-cloud-core-2.4.3:
Successfully uninstalled google-cloud-core-2.4.3
Attempting uninstall: google-cloud-storage
Found existing installation: google-cloud-storage 2.19.0
Uninstalling google-cloud-storage-2.19.0:
Successfully uninstalled google-cloud-storage-2.19.0
Attempting uninstall: gcsfs
Found existing installation: gcsfs 2025.3.2
Uninstalling gcsfs-2025.3.2:
Successfully uninstalled gcsfs-2025.3.2
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 2.19.1 requires fsspec[http]<=2024.3.1,>=2023.1.0, but you have fsspec 2025.3.2 which is incompatible.
opentelemetry-proto 1.32.1 requires protobuf<6.0,>=5.0, but you have protobuf 6.30.2 which is incompatible.
google-colab 1.0.0 requires google-auth==2.38.0, but you have google-auth 2.39.0 which is incompatible.
ydf 0.11.0 requires protobuf<6.0.0,>=5.29.1, but you have protobuf 6.30.2 which is incompatible.
tensorflow-metadata 1.17.1 requires protobuf<6.0.0,>=4.25.2; python_version >= "3.11", but you have protobuf 6.30.2 which is incompatible.
torchaudio 2.6.0+cu124 requires torch==2.6.0, but you have torch 2.5.1+cu121 which is incompatible.
moviepy 1.0.3 requires decorator<5.0,>=4.0.2, but you have decorator 5.2.1 which is incompatible.
google-cloud-aiplatform 1.90.0 requires google-cloud-storage<3.0.0,>=1.32.0, but you have google-cloud-storage 3.1.0 which is incompatible.
google-ai-generativelanguage 0.6.15 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0dev,>=3.20.2, but you have protobuf 6.30.2 which is incompatible.
torchvision 0.21.0+cu124 requires torch==2.6.0, but you have torch 2.5.1+cu121 which is incompatible.
tensorflow 2.18.0 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0dev,>=3.20.3, but you have protobuf 6.30.2 which is incompatible.
grpcio-status 1.71.0 requires protobuf<6.0dev,>=5.26.1, but you have protobuf 6.30.2 which is incompatible.
Successfully installed aiohappyeyeballs-2.6.1 aiohttp-3.11.18 aiosignal-1.3.2 attrs-25.3.0 cachetools-5.5.2 certifi-2025.4.26 charset-normalizer-3.4.1 decorator-5.2.1 frozenlist-1.6.0 fsspec-2025.3.2 gcsfs-2025.3.2 google-api-core-2.24.2 google-auth-2.39.0 google-auth-oauthlib-1.2.2 google-cloud-core-2.4.3 google-cloud-storage-3.1.0 google-crc32c-1.7.1 google-resumable-media-2.7.2 googleapis-common-protos-1.70.0 idna-3.10 multidict-6.4.3 oauthlib-3.2.2 propcache-0.3.1 proto-plus-1.26.1 protobuf-6.30.2 pyasn1-0.6.1 pyasn1-modules-0.4.2 requests-2.32.3 requests-oauthlib-2.0.0 rsa-4.9.1 urllib3-2.4.0 yarl-1.20.0
!pip uninstall torch torchvision -y
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Found existing installation: torch 2.5.1+cu121
Uninstalling torch-2.5.1+cu121:
Successfully uninstalled torch-2.5.1+cu121
Found existing installation: torchvision 0.21.0+cu124
Uninstalling torchvision-0.21.0+cu124:
Successfully uninstalled torchvision-0.21.0+cu124
Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting torch
Downloading https://download.pytorch.org/whl/cu118/torch-2.7.0%2Bcu118-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (28 kB)
Collecting torchvision
Downloading https://download.pytorch.org/whl/cu118/torchvision-0.22.0%2Bcu118-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (6.1 kB)
Requirement already satisfied: torchaudio in /usr/local/lib/python3.11/dist-packages (2.6.0+cu124)
Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from torch) (3.18.0)
Requirement already satisfied: typing-extensions>=4.10.0 in /usr/local/lib/python3.11/dist-packages (from torch) (4.13.2)
Collecting sympy>=1.13.3 (from torch)
Downloading https://download.pytorch.org/whl/sympy-1.13.3-py3-none-any.whl.metadata (12 kB)
Requirement already satisfied: networkx in /usr/local/lib/python3.11/dist-packages (from torch) (3.4.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.11/dist-packages (from torch) (3.1.6)
Requirement already satisfied: fsspec in /usr/local/lib/python3.11/dist-packages (from torch) (2025.3.2)
Collecting nvidia-cuda-nvrtc-cu11==11.8.89 (from torch)
Downloading https://download.pytorch.org/whl/cu118/nvidia_cuda_nvrtc_cu11-11.8.89-py3-none-manylinux1_x86_64.whl (23.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.2/23.2 MB 108.1 MB/s eta 0:00:00
Collecting nvidia-cuda-runtime-cu11==11.8.89 (from torch)
Downloading https://download.pytorch.org/whl/cu118/nvidia_cuda_runtime_cu11-11.8.89-py3-none-manylinux1_x86_64.whl (875 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 875.6/875.6 kB 55.6 MB/s eta 0:00:00
Collecting nvidia-cuda-cupti-cu11==11.8.87 (from torch)
Downloading https://download.pytorch.org/whl/cu118/nvidia_cuda_cupti_cu11-11.8.87-py3-none-manylinux1_x86_64.whl (13.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.1/13.1 MB 123.1 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu11==9.1.0.70 (from torch)
Downloading https://download.pytorch.org/whl/cu118/nvidia_cudnn_cu11-9.1.0.70-py3-none-manylinux2014_x86_64.whl (663.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 663.9/663.9 MB 2.2 MB/s eta 0:00:00
Collecting nvidia-cublas-cu11==11.11.3.6 (from torch)
Downloading https://download.pytorch.org/whl/cu118/nvidia_cublas_cu11-11.11.3.6-py3-none-manylinux1_x86_64.whl (417.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 417.9/417.9 MB 3.1 MB/s eta 0:00:00
Collecting nvidia-cufft-cu11==10.9.0.58 (from torch)
Downloading https://download.pytorch.org/whl/cu118/nvidia_cufft_cu11-10.9.0.58-py3-none-manylinux1_x86_64.whl (168.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 168.4/168.4 MB 9.6 MB/s eta 0:00:00
Collecting nvidia-curand-cu11==10.3.0.86 (from torch)
Downloading https://download.pytorch.org/whl/cu118/nvidia_curand_cu11-10.3.0.86-py3-none-manylinux1_x86_64.whl (58.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.1/58.1 MB 43.2 MB/s eta 0:00:00
Collecting nvidia-cusolver-cu11==11.4.1.48 (from torch)
Downloading https://download.pytorch.org/whl/cu118/nvidia_cusolver_cu11-11.4.1.48-py3-none-manylinux1_x86_64.whl (128.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 128.2/128.2 MB 5.6 MB/s eta 0:00:00
Collecting nvidia-cusparse-cu11==11.7.5.86 (from torch)
Downloading https://download.pytorch.org/whl/cu118/nvidia_cusparse_cu11-11.7.5.86-py3-none-manylinux1_x86_64.whl (204.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 204.1/204.1 MB 5.1 MB/s eta 0:00:00
Collecting nvidia-nccl-cu11==2.21.5 (from torch)
Downloading https://download.pytorch.org/whl/cu118/nvidia_nccl_cu11-2.21.5-py3-none-manylinux2014_x86_64.whl (147.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 147.8/147.8 MB 17.3 MB/s eta 0:00:00
Collecting nvidia-nvtx-cu11==11.8.86 (from torch)
Downloading https://download.pytorch.org/whl/cu118/nvidia_nvtx_cu11-11.8.86-py3-none-manylinux1_x86_64.whl (99 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 kB 8.6 MB/s eta 0:00:00
Collecting triton==3.3.0 (from torch)
Downloading https://download.pytorch.org/whl/triton-3.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.5 kB)
Requirement already satisfied: setuptools>=40.8.0 in /usr/local/lib/python3.11/dist-packages (from triton==3.3.0->torch) (75.2.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.11/dist-packages (from torchvision) (1.26.4)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.11/dist-packages (from torchvision) (11.2.1)
INFO: pip is looking at multiple versions of torchaudio to determine which version is compatible with other requirements. This could take a while.
Collecting torchaudio
Downloading https://download.pytorch.org/whl/cu118/torchaudio-2.7.0%2Bcu118-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (6.6 kB)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.11/dist-packages (from sympy>=1.13.3->torch) (1.3.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.11/dist-packages (from jinja2->torch) (3.0.2)
Downloading https://download.pytorch.org/whl/cu118/torch-2.7.0%2Bcu118-cp311-cp311-manylinux_2_28_x86_64.whl (955.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 955.6/955.6 MB 1.9 MB/s eta 0:00:00
Downloading https://download.pytorch.org/whl/triton-3.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (156.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.5/156.5 MB 16.3 MB/s eta 0:00:00
Downloading https://download.pytorch.org/whl/cu118/torchvision-0.22.0%2Bcu118-cp311-cp311-manylinux_2_28_x86_64.whl (6.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.7/6.7 MB 115.6 MB/s eta 0:00:00
Downloading https://download.pytorch.org/whl/cu118/torchaudio-2.7.0%2Bcu118-cp311-cp311-manylinux_2_28_x86_64.whl (3.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/3.3 MB 101.1 MB/s eta 0:00:00
Downloading https://download.pytorch.org/whl/sympy-1.13.3-py3-none-any.whl (6.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 122.8 MB/s eta 0:00:00
Installing collected packages: triton, sympy, nvidia-nvtx-cu11, nvidia-nccl-cu11, nvidia-cusparse-cu11, nvidia-curand-cu11, nvidia-cufft-cu11, nvidia-cuda-runtime-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cuda-cupti-cu11, nvidia-cublas-cu11, nvidia-cusolver-cu11, nvidia-cudnn-cu11, torch, torchvision, torchaudio
Attempting uninstall: triton
Found existing installation: triton 3.1.0
Uninstalling triton-3.1.0:
Successfully uninstalled triton-3.1.0
Attempting uninstall: sympy
Found existing installation: sympy 1.13.1
Uninstalling sympy-1.13.1:
Successfully uninstalled sympy-1.13.1
Attempting uninstall: torchaudio
Found existing installation: torchaudio 2.6.0+cu124
Uninstalling torchaudio-2.6.0+cu124:
Successfully uninstalled torchaudio-2.6.0+cu124
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
fastai 2.7.19 requires torch<2.7,>=1.10, but you have torch 2.7.0+cu118 which is incompatible.
Successfully installed nvidia-cublas-cu11-11.11.3.6 nvidia-cuda-cupti-cu11-11.8.87 nvidia-cuda-nvrtc-cu11-11.8.89 nvidia-cuda-runtime-cu11-11.8.89 nvidia-cudnn-cu11-9.1.0.70 nvidia-cufft-cu11-10.9.0.58 nvidia-curand-cu11-10.3.0.86 nvidia-cusolver-cu11-11.4.1.48 nvidia-cusparse-cu11-11.7.5.86 nvidia-nccl-cu11-2.21.5 nvidia-nvtx-cu11-11.8.86 sympy-1.13.3 torch-2.7.0+cu118 torchaudio-2.7.0+cu118 torchvision-0.22.0+cu118 triton-3.3.0
Note:if importing fails because of UMAP, run the cell again or reimport the UMAP only.
import json
import os
import warnings
import zipfile
import gdown
from collections import Counter
from pathlib import Path
from typing import Dict, List
import chromadb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import seaborn as sns
import torch
import umap
pd.set_option('display.max_colwidth', 100)
from chromadb import Collection, Documents, EmbeddingFunction, Embeddings
from datasets import Dataset
from dotenv import load_dotenv
from langdetect import detect
from lorem_text import lorem
from ragas import RunConfig, evaluate
from ragas.metrics import (faithfulness, answer_relevancy, context_relevancy, answer_correctness)
from spacy.lang.en import English
from textstat import flesch_reading_ease
from tqdm import tqdm
from langchain.chains.base import Chain
from langchain.text_splitter import RecursiveCharacterTextSplitter, TextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma, VectorStore
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from langchain_core.embeddings import Embeddings
from langchain_core.language_models import LLM
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.retrievers import BaseRetriever
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
import matplotlib.pyplot as plt
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
nlp = English()
tokenizer = nlp.tokenizer
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
load_dotenv()
warnings.filterwarnings("ignore")
[nltk_data] Downloading package punkt_tab to /root/nltk_data... [nltk_data] Package punkt_tab is already up-to-date! [nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package wordnet to /root/nltk_data... [nltk_data] Package wordnet is already up-to-date!
By running the two following cells below, you can download the dataset, chunks, embeddings of the chunks and evaluation results from our Google Drive. This will save you time and money.
data_folder = Path("./data")
if not data_folder.exists():
data_folder.mkdir()
file_id = "1VvlAUqXyUeSC35rMcmNoJgaFxIhjoHG4"
url = f"https://drive.google.com/uc?id={file_id}"
gdown.download(url, output="data.zip", quiet=True)
with zipfile.ZipFile("data.zip", "r") as zip_file:
zip_file.extractall(path=data_folder)
# %%script echo skipping
# !gdown 1VmFQFogA4S7iy5iX2-Tsl3xa11zNnUFR
# %%script echo skipping
# with zipfile.ZipFile("data.zip", "r") as zip_file:
# zip_file.extractall()
Setting up our LLM¶
To make sure our OpenAI Key is working we will test it by generating a response from GPT-4o which we will later on also be using in our RAG pipeline. Try some different prompts or questions to see how the model responds.
llm = ChatOpenAI(model="gpt-4o")
question_prompt = ChatPromptTemplate.from_template(
"Answer the following question: {question}")
question_chain = question_prompt | llm | StrOutputParser()
question_chain.invoke({"question": "What is the meaning of life?"})
"The meaning of life is a deeply philosophical question that has been contemplated by thinkers, theologians, and individuals throughout history. Different cultures, religions, and philosophies offer various interpretations, including:\n\n1. **Religious Perspectives**: Many religions suggest that the meaning of life is to fulfill a divine purpose, seek enlightenment, or achieve a connection with a higher power.\n\n2. **Philosophical Views**: Philosophers have proposed various theories, such as existentialism, which emphasizes individual freedom and choice, and utilitarianism, which focuses on the greatest happiness for the greatest number.\n\n3. **Personal Interpretations**: For many, the meaning of life is subjective and can be found in personal fulfillment, relationships, love, creativity, and the pursuit of knowledge.\n\nUltimately, the meaning of life may vary for each person, and it can be shaped by experiences, beliefs, and values. It's a question that invites exploration and reflection, encouraging individuals to find their own answers."
Downloading the Dataset from Kaggle¶
We’ll use the kagglehub helper to pull in both our media and patent datasets directly into pandas DataFrames
import kagglehub
from kagglehub import KaggleDatasetAdapter
file_path_media_dataset_v3_2024_10_28 = "cleantech_media_dataset_v3_2024-10-28.csv"
articles_df = kagglehub.load_dataset(
KaggleDatasetAdapter.PANDAS,
"jannalipenkova/cleantech-media-dataset",
file_path_media_dataset_v3_2024_10_28,
)
file_path_media_dataset_v3_2024_09_20 = "cleantech_rag_evaluation_data_2024-09-20.csv"
human_eval = kagglehub.load_dataset(
KaggleDatasetAdapter.PANDAS,
"jannalipenkova/cleantech-media-dataset",
file_path_media_dataset_v3_2024_09_20,
pandas_kwargs={
"sep": ";",
"engine": "python",
"encoding": "latin-1",
}
)
file_paths = [
'bq-results-20240124-055833-1706076079048.json',
'CleanTech_22-24.json',
'CleanTech_22-24_updated.json'
]
dfs = {}
for file_path in file_paths:
df = kagglehub.load_dataset(
KaggleDatasetAdapter.PANDAS,
"prakharbhandari20/cleantech-google-patent-dataset",
"bq-results-20240124-055833-1706076079048.json",
pandas_kwargs={
"lines": True,
})
dfs[file_path] = df
patent_df = dfs['CleanTech_22-24_updated.json']
bronze_folder = data_folder / "bronze"
if not bronze_folder.exists():
bronze_folder.mkdir()
Loading the Dataset into Dataframes¶
We now load and inspect both the Cleantech Media Dataset and the gold-standard evaluation data provided by our subject matter expert, Janna Lipenkova.
articles_df = pd.read_csv(
bronze_folder / "cleantech_media_dataset_v3_2024-10-28.csv",
encoding='utf-8', index_col=0, on_bad_lines='skip')
articles_df.head()
| title | date | author | content | domain | url | |
|---|---|---|---|---|---|---|
| 93320 | XPeng Delivered ~100,000 Vehicles In 2021 | 2022-01-02 | NaN | ['Chinese automotive startup XPeng has shown one of the most dramatic auto production ramp-ups i... | cleantechnica | https://cleantechnica.com/2022/01/02/xpeng-delivered-100000-vehicles-in-2021/ |
| 93321 | Green Hydrogen: Drop In Bucket Or Big Splash? | 2022-01-02 | NaN | ['Sinopec has laid plans to build the largest green hydrogen production facility in the world, b... | cleantechnica | https://cleantechnica.com/2022/01/02/its-a-green-hydrogen-drop-in-the-bucket-but-it-could-still-... |
| 98159 | World’ s largest floating PV plant goes online in China – pv magazine International | 2022-01-03 | NaN | ['Huaneng Power International has switched on a 320 MW floating PV array in China’ s Shandong pr... | pv-magazine | https://www.pv-magazine.com/2022/01/03/worlds-largest-floating-pv-plant-goes-online-in-china/ |
| 98158 | Iran wants to deploy 10 GW of renewables over the next four years – pv magazine International | 2022-01-03 | NaN | ['According to the Iranian authorities, there are currently more than 80GW of renewable energy p... | pv-magazine | https://www.pv-magazine.com/2022/01/03/iran-wants-to-deploy-10-gw-of-renewables-over-the-next-fo... |
| 31128 | Eastern Interconnection Power Grid Said ‘ Being Challenged in New Ways’ | 2022-01-03 | NaN | ['Sign in to get the best natural gas news and data. Follow the topics you want and receive the ... | naturalgasintel | https://www.naturalgasintel.com/eastern-interconnection-power-grid-said-being-challenged-in-new-... |
patents_df = pd.read_json(bronze_folder /"CleanTech_22-24_updated.json", lines=True)
patents_df.head(5)
| publication_number | application_number | country_code | title | abstract | publication_date | inventor | cpc_code | |
|---|---|---|---|---|---|---|---|---|
| 0 | CN-117138249-A | CN-202311356270-A | CN | 一种石墨烯光疗面罩 | The application provides a graphene phototherapy mask, and relates to the technical field of pho... | 20231201 | [LI HAITAO, CAO WENQIANG] | A61N2005/0654 |
| 1 | CN-117151396-A | CN-202311109834-A | CN | Distributed economic scheduling method for wind, solar, biogas and hydrogen multi-energy multi-m... | The invention discloses a distributed economic dispatching method of a wind, solar and methane h... | 20231201 | [HU PENGFEI, LI ZIMENG] | G06Q50/06 |
| 2 | CN-117141530-A | CN-202310980795-A | CN | 氢能源动力轨道车辆组 | The invention discloses a hydrogen energy power rail vehicle group, which comprises a power vehi... | 20231201 | [XIE BO, ZHANG SHUIQING, ZHOU FEI, LIU YONG, Zhou Houyi] | Y02T90/40 |
| 3 | CN-117141244-A | CN-202311177651-A | CN | 一种汽车太阳能充电系统、方法及新能源汽车 | The application discloses an automobile solar charging system, an automobile solar charging meth... | 20231201 | [ZHAO PENGCHENG] | B60K16/00 |
| 4 | CN-117146094-A | CN-202311272549-A | CN | 一种水利水电管道连接装置 | The invention provides a water conservancy and hydropower pipeline connecting device, which effe... | 20231201 | [LYU SHUOSHUO, LI PANFENG, XU ZHENGWEI, WANG WEIBIN, ZHANG CHEN, ZHOU HAIYUN] | F16L55/02 |
# Keep only one row for each `publication_number`, ignoring differences in `cpc_code` or `inventor`
print(f"Number of rows before: {len(patents_df)}")
patents_df_unique = patents_df.drop_duplicates(subset=["publication_number","application_number","country_code", "title", "abstract", "publication_date"])
print(f"Number of rows after: {len(patents_df_unique)}")
Number of rows before: 406857 Number of rows after: 68125
publication_number_counts = patents_df_unique.groupby('publication_number').size().sort_values(ascending=False)
print(publication_number_counts)
publication_number
WO-2024027062-A1 10
WO-2024131301-A1 10
WO-2022183304-A1 10
EP-4052298-A1 8
WO-2022265333-A1 7
..
US-11828147-B2 1
US-11828138-B2 1
US-11824484-B2 1
US-11824363-B2 1
ZA-202309063-B 1
Length: 31366, dtype: int64
# patents_df_unique.to_csv(bronze_folder / "cleantech_patents_with_lang.csv", index=False)
output_path = bronze_folder / "cleantech_patents_with_lang.csv"
if output_path.exists():
patents_df_unique = pd.read_csv(output_path)
else:
print("dataset not found. Running language detection...")
# Detect language of title and abstract
patents_df_unique["title_lang"] = patents_df_unique["title"].apply(detect)
patents_df_unique["abstract_lang"] = patents_df_unique["abstract"].apply(detect)
# Save full dataset
patents_df_unique.to_csv(output_path, index=False)
print(f"Saved full dataset with language detection: {output_path.name}")
# Filter only English rows
patents_df_unique_en = patents_df_unique[
(patents_df_unique["title_lang"] == "en") &
(patents_df_unique["abstract_lang"] == "en")
]
patents_df_unique_en.head(5)
| publication_number | application_number | country_code | title | abstract | publication_date | inventor | cpc_code | title_lang | abstract_lang | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | CN-117151396-A | CN-202311109834-A | CN | Distributed economic scheduling method for wind, solar, biogas and hydrogen multi-energy multi-m... | The invention discloses a distributed economic dispatching method of a wind, solar and methane h... | 20231201 | ['HU PENGFEI', 'LI ZIMENG'] | G06Q50/06 | en | en |
| 5 | CN-117147382-A | CN-202310985511-A | CN | Device for monitoring hydrogen atom crossing grain boundary diffusion by using SKPFM and testing... | The invention provides a device and a method for monitoring hydrogen atom crossing grain boundar... | 20231201 | ['MA ZHAOXIANG', 'WANG CHENGXU', 'LIU ZHONGLI'] | G01N13/00 | en | en |
| 6 | CN-113344288-B | CN-202110717505-A | CN | Cascade hydropower station group water level prediction method and device and computer readable ... | The invention discloses a cascade hydropower station group water level prediction method, a casc... | 20231201 | [] | G06Q10/04 | en | en |
| 8 | CN-117153944-A | CN-202311209193-A | CN | Heterojunction solar cell, preparation method thereof and photovoltaic module | The application provides a heterojunction solar cell, a preparation method thereof and a photovo... | 20231201 | ['TONG HONGBO', 'JIN YUPENG'] | H01L31/074 | en | en |
| 9 | CN-116911695-B | CN-202311167289-A | CN | Flexible resource adequacy evaluation method and device | The invention relates to a flexible resource adequacy evaluation method and device of an electri... | 20231201 | [] | H02J2203/20 | en | en |
human_eval_df = pd.read_csv(
bronze_folder / "cleantech_rag_evaluation_data_2024-09-20.csv",
encoding='utf-8',
# index_col=0,
sep=';')
human_eval_df.head()
| example_id | question_id | question | relevant_text | answer | article_url | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | What is the innovation behind Leclanché's new method to produce lithium-ion batteries? | Leclanché said it has developed an environmentally friendly way to produce lithium-ion (Li-ion) ... | Leclanché's innovation is using a water-based process instead of highly toxic organic solvents t... | https://www.sgvoice.net/strategy/technology/23971/leclanches-new-disruptive-battery-boosts-energ... |
| 1 | 2 | 2 | What is the EU’s Green Deal Industrial Plan? | The Green Deal Industrial Plan is a bid by the EU to make its net zero industry more competitive... | The EU’s Green Deal Industrial Plan aims to enhance the competitiveness of its net zero industry... | https://www.sgvoice.net/policy/25396/eu-seeks-competitive-boost-with-green-deal-industrial-plan/ |
| 2 | 3 | 2 | What is the EU’s Green Deal Industrial Plan? | The European counterpart to the US Inflation Reduction Act (IRA) aims to create an environment t... | The EU’s Green Deal Industrial Plan aims to enhance the competitiveness of its net zero industry... | https://www.pv-magazine.com/2023/02/02/european-commission-introduces-green-deal-industrial-plan/ |
| 3 | 4 | 3 | What are the four focus areas of the EU's Green Deal Industrial Plan? | The new plan is fundamentally focused on four areas, or pillars: the regulatory environment, acc... | The four focus areas of the EU's Green Deal Industrial Plan are the regulatory environment, acce... | https://www.sgvoice.net/policy/25396/eu-seeks-competitive-boost-with-green-deal-industrial-plan/ |
| 4 | 5 | 4 | When did the cooperation between GM and Honda on fuel cell vehicles start? | What caught our eye was a new hookup between GM and Honda. Honda was also hammering away at the ... | July 2013 | https://cleantechnica.com/2023/05/08/general-motors-seizes-the-fuel-cell-moment-with-green-hydro... |
human_eval_df_media = pd.read_csv(
bronze_folder / "generated_questions_media_qa_llama.csv",
encoding='utf-8',
# index_col=0,
# sep=';'
)
human_eval_df_media.head()
| id | context | question | answer | source_doc | category | groundedness_score | groundedness_eval | relevance_score | relevance_eval | standalone_score | standalone_eval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What is green hydrogen, and how is it produced? | Green hydrogen is a sustainable energy carrier produced by water electrolysis using renewable en... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The provided context clearly explains what green hydrogen is, how it is produced through water e... | 3.0 | The question is straightforward and clear, inquiring about a specific concept (green hydrogen) a... | 5.0 | The question is self-explanatory and does not rely on additional context to be understood. It is... |
| 1 | 1 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What is the significance of the Neom project in Saudi Arabia as a pioneering example of green hy... | The Neom project, a partnership between ACWA Power, Air Products, and NEOM, harnesses solar and ... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The context provides a comprehensive overview of the significance of green hydrogen and its inte... | 4.0 | The Neom project is a key concept in the context of green hydrogen integration, and understandin... | 4.0 | The question appears to rely on additional knowledge about the Neom project and its connection t... |
| 2 | 2 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What is the significance of the Energy Transitions Commission's report on making clean electrifi... | The report highlights the need for a 30-year transition to electrify the global economy, providi... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The question regarding the significance of the Energy Transitions Commission's report "Making Cl... | 3.0 | This question appears to be relevant to environmental sustainability and energy policy, which mi... | 5.0 | The question refers to a specific institution (Energy Transitions Commission) and a specific con... |
| 3 | 3 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | How does green hydrogen compare to direct use of electricity in terms of energy efficiency? | Green hydrogen production through electrolysis is less energy-efficient than direct use of elect... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 4.0 | The question requires an in-depth analysis of the context provided, specifically focusing on the... | 4.0 | This question is relevant to NLP developers building applications that may use energy-intensive ... | 5.0 | The question implies that there might be some universal or general information about green hydro... |
| 4 | 4 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What are some of the examples of pilot projects testing the viability of green hydrogen in vario... | Various pilot projects are testing green hydrogen's viability in energy systems, demonstrating i... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The question can be answered unambiguously based on the provided context, which describes variou... | 3.0 | This question appears to be focused on environmental sustainability and energy systems, which is... | 4.0 | The question asks for specific examples of pilot projects, which implies the existence of a cont... |
human_eval_df_patent = pd.read_csv(
bronze_folder / "generated_questions_patent_qa_llama.csv",
encoding='utf-8',
# index_col=0,
# sep=';'
)
human_eval_df_patent.head()
| id | context | question | answer | category | groundedness_score | groundedness_eval | relevance_score | relevance_eval | standalone_score | standalone_eval | title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Distributed photovoltaic energy storage refrigeration house systemThe utility model discloses a ... | How does the system reduce the cost of cold storage? | The system reduces the cost of cold storage by converting solar energy into electric energy, whi... | Sustainability & Technological Innovation Questions | 5.0 | The question can be answered based on the given context, and the answer is clear and unambiguous. | 4.0 | The question is concise and to the point, directly asking about a specific aspect of how the Hug... | 5.0 | The question appears to be related to a general concept of cost reduction in the context of data... | Distributed photovoltaic energy storage refrigeration house system |
| 1 | 1 | Water path manifold structure of hydrogen energy automobile electric pileThe utility model discl... | What is good about the utility model? | The utility model has a simple structure that is easy to assemble and disassemble. | Analytical & Explanatory Questions | 5.0 | The question "What is good about the utility model?" is somewhat ambiguous without further clari... | 3.0 | The question is very short and to the point, but it lacks context and detail about the specific ... | 5.0 | The question seems to be asking about a general property or characteristic of a "utility model",... | Water path manifold structure of hydrogen energy automobile electric pile |
| 2 | 2 | Active power control method of water-fire-wind-solar energy storage multi-energy complementary i... | What is the purpose of using power supply with better regulation performance to compensate for p... | The power supply with better regulation performance is used to carry out compensation regulation... | Government & Corporate Initiatives | 5.0 | The question is clearly answerable by understanding the purpose of using power supply with bette... | 3.0 | The question is directly related to power supply regulation and its impact on the performance of... | 5.0 | The question assumes knowledge of power supplies in general, specifically their regulation perfo... | Active power control method of water-fire-wind-solar energy storage multi-energy complementary i... |
| 3 | 3 | Water conservancy and hydropower engineering construction tunnel internal flow guiding and drain... | What is the water conservancy and hydropower engineering construction hole inner diversion drain... | The utility model discloses a water conservancy and hydropower engineering construction hole inn... | Sustainability & Technological Innovation Questions | 5.0 | The question is clearly answerable with the given context, as it describes a specific inner dive... | 3.0 | The question seems to be about a specific technical term, which may be useful for machine learni... | 5.0 | The question contains technical terms and a specific reference to a concept that appears to be w... | Water conservancy and hydropower engineering construction tunnel internal flow guiding and drain... |
| 4 | 4 | Medium-and-long-term electric power quantity balancing method for electric power system containi... | What is the main consideration for the balancing method, in addition to safety and economy? | The seasonal characteristics of renewable energy sources in time and the coordination problem of... | Analytical & Explanatory Questions | 4.0 | The context provides a detailed description of a method for balancing electric quantity in a pow... | 4.0 | The question is asking about a specific aspect of the balancing method, which is a common techni... | 5.0 | The question does not provide a specific context, and the balancing method is a general concept ... | Medium-and-long-term electric power quantity balancing method for electric power system containi... |
Explorative Data Analysis & Preprocessing¶
As the saying goes, "garbage in, garbage out." In the realm of machine learning, the quality of our outputs hinges on the quality of our inputs. This section delves into the essential processes of Exploratory Data Analysis (EDA) and data preprocessing. Through EDA, we'll illuminate the characteristics, patterns, and potential quirks residing within our cleantech news article dataset. Preprocessing will ensure our data is cleansed, structured, and prepared to be effectively utilized by the RAG pipeline, laying the foundation for high-quality results.
Let us start by gaining an overview of the datasets features (columns).
Media Datset¶
articles_df.describe()
| author | |
|---|---|
| count | 0.0 |
| mean | NaN |
| std | NaN |
| min | NaN |
| 25% | NaN |
| 50% | NaN |
| 75% | NaN |
| max | NaN |
articles_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 20111 entries, 93320 to 101431 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 20111 non-null object 1 date 20111 non-null object 2 author 0 non-null float64 3 content 20111 non-null object 4 domain 20111 non-null object 5 url 20111 non-null object dtypes: float64(1), object(5) memory usage: 1.6+ MB
Our initial exploration reveals that the "author" column only contains data for 31 out of 9593 articles. Since this offers minimal information gain, we can remove this feature.
We've also observed that some titles and content entries appear to be non-unique. This might necessitate identifying and removing duplicate entries.
On a positive note, the article URLs are all unique, potentially serving as suitable unique identifiers for the data.
articles_df = articles_df.drop(columns=["author"])
Article Domains¶
The dataset helpfully provides the domain names extracted from the article URLs. These domains essentially represent the publishers of the news articles. Let's analyze the distribution of publishers and see how many articles each publisher has contributed.
domain_counts = articles_df["domain"].value_counts()
domain_counts
| count | |
|---|---|
| domain | |
| energy-xprt | 4181 |
| pv-magazine | 3093 |
| azocleantech | 2488 |
| cleantechnica | 2089 |
| pv-tech | 1969 |
| thinkgeoenergy | 1052 |
| solarpowerportal.co | 850 |
| energyvoice | 828 |
| solarpowerworldonline | 785 |
| solarindustrymag | 621 |
| solarquarter | 606 |
| rechargenews | 573 |
| naturalgasintel | 298 |
| iea | 173 |
| energyintel | 171 |
| greenprophet | 130 |
| greenairnews | 59 |
| ecofriend | 55 |
| all-energy | 39 |
| decarbxpo | 20 |
| storagesummit | 15 |
| eurosolar | 9 |
| indorenergy | 4 |
| bex-asia | 2 |
| biofuels-news | 1 |
A visualization helps us to understand the skew in the data.
barplot = sns.barplot(
x=domain_counts.values,
y=domain_counts.index,
hue=domain_counts.index
)
barplot.set_title('Article Counts by Domain')
barplot.set_xlabel('Article Count')
barplot.set_ylabel('Domain')
plt.show()
Our exploration of article domains reveals a skewed distribution. Publishers like cleantechnica have a significantly higher representation (1861 articles), while others like indoenergy have minimal contributions (2 articles). If we proceed with sampling this data, this imbalance should be taken into account. Stratified sampling is the recommended approach to ensure a representative sample across different publishers.
Article Dates¶
Each article within the dataset is accompanied by a publication date. Let's delve into the temporal range of these articles and investigate any noteworthy patterns in publication trends.
# plot the amount of articles over time
articles_df["date"] = pd.to_datetime(articles_df["date"])
time_df = articles_df.groupby("date").size().reset_index()
time_df.columns = ["date","count"]
time_df.describe()
| date | count | |
|---|---|---|
| count | 979 | 979.000000 |
| mean | 2023-05-26 05:11:49.703779328 | 20.542390 |
| min | 2022-01-02 00:00:00 | 1.000000 |
| 25% | 2022-09-19 12:00:00 | 8.000000 |
| 50% | 2023-05-26 00:00:00 | 17.000000 |
| 75% | 2024-01-30 12:00:00 | 23.000000 |
| max | 2024-10-24 00:00:00 | 1812.000000 |
| std | NaN | 76.837351 |
sns.lineplot(data=time_df, x="date", y="count")
plt.title("Article Count Over Time")
plt.xlabel("Date")
plt.xticks(rotation=90)
plt.ylabel("Article Count")
# add a line for the average
avg_count = time_df["count"].mean()
plt.axhline(avg_count, color='r', linestyle='--', label=f"Average article count per day: {avg_count:.2f}")
plt.legend()
plt.show()
While the daily article count appears consistent overall, a significant outlier disrupts the pattern on the 2023-12-05. The cause of this outlier is undetermined, but it could potentially be the date the data was scraped and the default value assigned for missing dates. Since the publication date is not crucial for RAG pipeline, we can remove it.
articles_df = articles_df.drop(columns=["date"])
Article Titles¶
As noted in our initial exploration, some articles share identical titles. Here, we'll focus on identifying and handling these duplicate titles to ensure a clean and consistent dataset for our RAG pipeline.
sns.histplot(articles_df["title"].str.len())
plt.title("Title Length Distribution")
plt.xlabel("Title Length")
plt.ylabel("Count")
avg_count = articles_df["title"].str.len().mean()
plt.axvline(avg_count, color='r', linestyle='--', label=f"Average title length: {avg_count:.2f}")
plt.legend()
plt.show()
articles_df["title"].duplicated().sum()
95
duplicate_titles = articles_df[articles_df["title"].duplicated(keep=False)].sort_values("title")
duplicate_titles.head(10)
| title | content | domain | url | |
|---|---|---|---|---|
| 105526 | 'Three quarters of power from wind ' | ['Three quarters of Dutch power will come from wind by 2031, the chief of the Netherlands Wind E... | rechargenews | https://www.rechargenews.com/wind/three-quarters-of-power-from-wind-dutch-aim-for-new-heights-wi... |
| 105527 | 'Three quarters of power from wind ' | ['Three quarters of Dutch power will come from wind by 2031, the chief of the Netherlands Wind E... | rechargenews | https://www.rechargenews.com/wind/three-quarters-of-power-from-wind-dutch-aim-for-new-heights-wi... |
| 13367 | ACWA Power: Quotes, Address, Contact | ['We use cookies to enhance your experience. By continuing to browse this site you agree to our ... | azocleantech | https://www.azocleantech.com/suppliers.aspx?SupplierID=1869 |
| 22292 | ACWA Power: Quotes, Address, Contact | ["By clicking `` Allow All '' you agree to the storing of cookies on your device to enhance site... | azocleantech | https://www.azocleantech.com/Suppliers.aspx?SupplierID=1869 |
| 13371 | ADS-TEC Energy: Quotes, Address, Contact | ['We use cookies to enhance your experience. By continuing to browse this site you agree to our ... | azocleantech | https://www.azocleantech.com/suppliers.aspx?SupplierID=2172 |
| 22256 | ADS-TEC Energy: Quotes, Address, Contact | ["By clicking `` Allow All '' you agree to the storing of cookies on your device to enhance site... | azocleantech | https://www.azocleantech.com/Suppliers.aspx?SupplierID=2172 |
| 22388 | ALUULA Composites Inc.: Quotes, Address, Contact | ["By clicking `` Allow All '' you agree to the storing of cookies on your device to enhance site... | azocleantech | https://www.azocleantech.com/Suppliers.aspx?SupplierID=2176 |
| 13366 | ALUULA Composites Inc.: Quotes, Address, Contact | ['We use cookies to enhance your experience. By continuing to browse this site you agree to our ... | azocleantech | https://www.azocleantech.com/suppliers.aspx?SupplierID=2176 |
| 13375 | AMETEK STC: Quotes, Address, Contact | ['We use cookies to enhance your experience. By continuing to browse this site you agree to our ... | azocleantech | https://www.azocleantech.com/suppliers.aspx?SupplierID=1370 |
| 22396 | AMETEK STC: Quotes, Address, Contact | ["By clicking `` Allow All '' you agree to the storing of cookies on your device to enhance site... | azocleantech | https://www.azocleantech.com/Suppliers.aspx?SupplierID=1370 |
duplicate_titles["content"].duplicated().sum()
6
Our exploration identified 24 titles that appear multiple times in the dataset. Examples include "About David J. Cross." Interestingly, while the titles are identical, the content itself appears to be unique.
Here are some additional observations for further investigation:
- A pattern was found where some articles begin with the phrase "By clicking." We'll delve into this further to determine the potential impact when analzing the article contents.
- We can observe instances of articles with seemingly similar content but differing URLs containing "sgvoice.energyvoice.com" and "energyvoice.com." Let's explore these cases to understand the potential distinction between them.
def wrap_text(text: str, char_per_line=100) -> str:
# for better readability, wrap the text at the last space before the char_per_line
if len(text) < char_per_line:
return text
else:
return text[:char_per_line].rsplit(' ', 1)[0] + '\n' + wrap_text(text[len(text[:char_per_line].rsplit(' ', 1)[0])+1:], char_per_line)
print(duplicate_titles.iloc[0]["title"])
print(wrap_text(duplicate_titles.iloc[0]["content"]))
'Three quarters of power from wind ' ['Three quarters of Dutch power will come from wind by 2031, the chief of the Netherlands Wind Energy Association predicted, shortly after the government opened bidding for each of the Ijmujiden Ver offshore wind sites, known respectively as Alpha 2GW and Beta 2GW.', 'Interested parties can hand in bids between up to March 28 for the twin far offshore zones in the North Sea, in what is the country’ s largest tender for wind at sea so far.', 'Recharge is part of DN Media Group. To read more about DN Media Group, click here', 'Recharge is part of DN Media Group AS. From November 1st DN Media Group is responsible for controlling your data on Recharge.', 'We use your data to ensure you have a secure and enjoyable user experience when visiting our site. You can read more about how we handle your information in our privacy policy.', 'DN Media Group is the leading news provider in the shipping, seafood, and energy industries, with a number of English- and Norwegian-language news publications across a variety of sectors. Read more about DN Media Group here.', 'Recharge is part of NHST Global Publications AS and we are responsible for the data that you register with us, and the data we collect when you visit our websites. We use cookies in a variety of ways to improve your experience, such as keeping NHST websites reliable and secure, personalising content and ads and to analyse how our sites are being used. For more information and how to manage your privacy settings, please refer to our privacy and cookie policies.']
print(duplicate_titles.iloc[1]["title"])
print(wrap_text(duplicate_titles.iloc[1]["content"]))
'Three quarters of power from wind ' ['Three quarters of Dutch power will come from wind by 2031, the chief of the Netherlands Wind Energy Association predicted, shortly after the government opened bidding for each of the Ijmujiden Ver offshore wind sites, known respectively as Alpha 2GW and Beta 2GW.', 'Interested parties can hand in bids between up to March 28 for the twin far offshore zones in the North Sea, in what is the country’ s largest tender for wind at sea so far.', 'Recharge is part of DN Media Group. To read more about DN Media Group, click here', 'Recharge is part of DN Media Group AS. From November 1st DN Media Group is responsible for controlling your data on Recharge.', 'We use your data to ensure you have a secure and enjoyable user experience when visiting our site. You can read more about how we handle your information in our privacy policy.', 'DN Media Group is the leading news provider in the shipping, seafood, and energy industries, with a number of English- and Norwegian-language news publications across a variety of sectors. Read more about DN Media Group here.', 'Recharge is part of NHST Global Publications AS and we are responsible for the data that you register with us, and the data we collect when you visit our websites. We use cookies in a variety of ways to improve your experience, such as keeping NHST websites reliable and secure, personalising content and ads and to analyse how our sites are being used. For more information and how to manage your privacy settings, please refer to our privacy and cookie policies.']
Our analysis suggests potential redundancy within certain articles. In some cases, the second article might appear to be the first article with an additional sentence appended at the end.
Let's take a closer look at these "energyvoice" articles and how the contents start and see if we can eliminate these redundancies.
energyvoice_articles = articles_df[articles_df["domain"].str.contains("energyvoice")]
energyvoice_articles.content.map(lambda x: x[:50]).value_counts()
| count | |
|---|---|
| content | |
| ['', '', 'The Megawatt Hour is the latest podcast | 6 |
| ['Two years after the Amazon Pledge Fund invested | 3 |
| ['A group of trade associations from across the en | 3 |
| ['Bruno Roche, Global Head of Energy Transition, A | 2 |
| ['The cost of clean hydrogen will fall to that of | 2 |
| ... | ... |
| ['Researchers have found an alternative way to ext | 1 |
| ['Controversy has been sparked over plans to build | 1 |
| ['Systems change consultancy Systemiq has released | 1 |
| ['ResponsibleSteel has launched a revised version | 1 |
| ['With Prime Minister Keir Starmer hosting EU lead | 1 |
790 rows × 1 columns
def remove_prefix_articles(df: pd.DataFrame, prefix_len: int = 100) -> pd.DataFrame:
"""
Takes O(n^2) time complexity
If the first {prefix_len} characters of the article are the same, then we consider them as a prefix.
If an article is a prefix of a longer article, then we remove it.
If an article is a prefix of longer article, but they have different titles, then we keep them.
"""
df["char_len"] = df["content"].map(len)
df = df.sort_values(by='char_len', ascending=True).reset_index(drop=True)
# Initialize a list to keep the articles that are not prefixes of others
non_prefix_articles = []
for i, row in df.iterrows():
is_prefix = False
content_i = row['content'][:prefix_len]
title_i = row['title']
for j in range(i + 1, len(df)):
content_j = df.at[j, 'content'][:prefix_len]
title_j = df.at[j, 'title']
if content_i == content_j:
# If the prefix matches but the titles are different, we keep it
if title_i != title_j:
continue
else:
is_prefix = True
break
if not is_prefix:
non_prefix_articles.append(row)
print(f"Removed {len(df) - len(non_prefix_articles)} prefix articles")
return pd.DataFrame(non_prefix_articles)
energyvoice_articles = remove_prefix_articles(energyvoice_articles)
energyvoice_articles.content.map(lambda x: x[:100]).value_counts()
Removed 11 prefix articles
| count | |
|---|---|
| content | |
| ['', '', 'The Megawatt Hour is the latest podcast boxset brought to you by Energy Voice Out Loud in | 6 |
| ['A group of trade associations from across the energy sector have written to the Chancellor urging | 3 |
| ['Two years after the Amazon Pledge Fund invested in Hippo Harvest, the company is selling its first | 3 |
| ['Nicola Sturgeon will today reveal her government’ s new energy strategy on the future of the North | 2 |
| ['Bruno Roche, Global Head of Energy Transition, ABB Process Automation, Energy Industries, outlines | 2 |
| ... | ... |
| ['Oil and gas giant BP ( LSE: BP) has handed out a contract for work on a key North Sea carbon captu | 1 |
| ['The jacket that supported the Ninian Northern platform for decades has completed its trip from the | 1 |
| ['Over the next ten years, the Global Wind Energy Council predicts that more than 380GW of offshore | 1 |
| ['An abrupt management shakeup at CHC has raised questions over turbulence at the North Sea helicopt | 1 |
| ['The results of Scotland’ s first offshore wind leasing round in more than a decade are expected to | 1 |
791 rows × 1 columns
There still seem to be be some redundancy, but we did manage to remove 11 duplicates.
Article Contents¶
Having explored various aspects of our dataset, we now turn our attention to the heart of the matter: the article content itself. This section will delve into the analysis and preprocessing techniques we'll employ to ensure the content is high-quality and effectively utilized by our RAG pipeline.
We start with a visual inspection of the article content.
np.random.seed(7)
random_sample_id = np.random.choice(articles_df.index)
print(wrap_text(articles_df.loc[random_sample_id, "content"]))
['Axis Energy, Juniper Green Energy, ReNew, and Acme have emerged as winners in NTPC’ s auction for 1.5 GW of wind-solar hybrid projects connected to India’ s interstate transmission system ( ISTS). The average price was INR 3.30 ( $ 0.040) /kWh.', 'NTPC has concluded its ( Tranche-IV) tender for ISTS-connected wind-solar hybrid capacity, with an average tariff of INR 3.30/kWh.', 'ABC Cleantech ( Axis Energy) won the biggest portion of 750 MW by quoting the lowest tariff of INR 3.27/kWh. Juniper Green Energy secured 300 MW at INR 3.29/kWh. The rest of the capacity was allocated to ReNew ( 300 MW) and ACME Cleantech ( 150 MW) at INR 3.32/kWh.', 'The winning developers will set up the projects on a build-own-operate ( BOO) basis. The projects can be located anywhere in India and must connect to the ISTS.', 'This content is protected by copyright and may not be reused. If you want to cooperate with us and would like to reuse some of our content, please contact: editors @ pv-magazine.com.', 'Please be mindful of our community standards.', 'Your email address will not be published. Required fields are marked *', 'Save my name, email, and website in this browser for the next time I comment.', 'By submitting this form you agree to pv magazine using your data for the purposes of publishing your comment.', 'Your personal data will only be disclosed or otherwise transmitted to third parties for the purposes of spam filtering or if this is necessary for technical maintenance of the website. Any other transfer to third parties will not take place unless this is justified on the basis of applicable data protection regulations or if pv magazine is legally obliged to do so.', 'You may revoke this consent at any time with effect for the future, in which case your personal data will be deleted immediately. Otherwise, your data will be deleted if pv magazine has processed your request or the purpose of data storage is fulfilled.', 'Further information on data privacy can be found in our Data Protection Policy.', 'This website uses cookies to anonymously count visitor numbers. View our privacy policy. ×', "The cookie settings on this website are set to `` allow cookies '' to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click `` Accept '' below then you are consenting to this."]
Our initial examination reveals that article content is currently stored as a list of strings. To gain deeper understanding and facilitate preprocessing, we'll transform these lists into a more cohesive textual format.
articles_df['article'] = articles_df['content'].apply(lambda x: ' '.join(eval(x)))
print(wrap_text(articles_df.loc[random_sample_id, "article"]))
Axis Energy, Juniper Green Energy, ReNew, and Acme have emerged as winners in NTPC’ s auction for 1.5 GW of wind-solar hybrid projects connected to India’ s interstate transmission system ( ISTS). The average price was INR 3.30 ( $ 0.040) /kWh. NTPC has concluded its ( Tranche-IV) tender for ISTS-connected wind-solar hybrid capacity, with an average tariff of INR 3.30/kWh. ABC Cleantech ( Axis Energy) won the biggest portion of 750 MW by quoting the lowest tariff of INR 3.27/kWh. Juniper Green Energy secured 300 MW at INR 3.29/kWh. The rest of the capacity was allocated to ReNew ( 300 MW) and ACME Cleantech ( 150 MW) at INR 3.32/kWh. The winning developers will set up the projects on a build-own-operate ( BOO) basis. The projects can be located anywhere in India and must connect to the ISTS. This content is protected by copyright and may not be reused. If you want to cooperate with us and would like to reuse some of our content, please contact: editors @ pv-magazine.com. Please be mindful of our community standards. Your email address will not be published. Required fields are marked * Save my name, email, and website in this browser for the next time I comment. By submitting this form you agree to pv magazine using your data for the purposes of publishing your comment. Your personal data will only be disclosed or otherwise transmitted to third parties for the purposes of spam filtering or if this is necessary for technical maintenance of the website. Any other transfer to third parties will not take place unless this is justified on the basis of applicable data protection regulations or if pv magazine is legally obliged to do so. You may revoke this consent at any time with effect for the future, in which case your personal data will be deleted immediately. Otherwise, your data will be deleted if pv magazine has processed your request or the purpose of data storage is fulfilled. Further information on data privacy can be found in our Data Protection Policy. This website uses cookies to anonymously count visitor numbers. View our privacy policy. × The cookie settings on this website are set to `` allow cookies '' to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click `` Accept '' below then you are consenting to this.
articles_df["article"].duplicated().sum()
43
duplicate_articles = articles_df[articles_df["article"].duplicated(keep=False)].sort_values("article")
duplicate_articles
| title | content | domain | url | article | |
|---|---|---|---|---|---|
| 92379 | Solar Plant Monitoring ( Energy Monitoring) Articles | ["2023-01-10 00:00For buyers of energy meters, the price, quality, functional characteristics, a... | energy-xprt | https://www.energy-xprt.com/energy-monitoring/solar-plant-monitoring/articles | 2023-01-10 00:00For buyers of energy meters, the price, quality, functional characteristics, app... |
| 89712 | Solar Plant Monitoring ( Solar Energy) Articles | ["2023-01-10 00:00For buyers of energy meters, the price, quality, functional characteristics, a... | energy-xprt | https://www.energy-xprt.com/solar-energy/solar-plant-monitoring/articles | 2023-01-10 00:00For buyers of energy meters, the price, quality, functional characteristics, app... |
| 89587 | Backup Power ( Power Distribution) News | ["A new report shows how California's premier public interest clean energy research and developm... | energy-xprt | https://www.energy-xprt.com/power-distribution/backup-power/news | A new report shows how California's premier public interest clean energy research and developmen... |
| 91830 | Backup Power ( Energy Management) News | ["A new report shows how California's premier public interest clean energy research and developm... | energy-xprt | https://www.energy-xprt.com/energy-management/backup-power/news | A new report shows how California's premier public interest clean energy research and developmen... |
| 36043 | Green Prophet - Page 10 of 692 - Sustainability news for the Middle East | ['A poetic look at climate change, drought and a celebrated Irish poet, James Joyce.', 'You want... | greenprophet | https://www.greenprophet.com/page/10/ | A poetic look at climate change, drought and a celebrated Irish poet, James Joyce. You want to i... |
| ... | ... | ... | ... | ... | ... |
| 13364 | Mosaic Materials, Inc.: Quotes, Address, Contact | ['We use cookies to enhance your experience. By continuing to browse this site you agree to our ... | azocleantech | https://www.azocleantech.com/Suppliers.aspx?SupplierID=1870 | We use cookies to enhance your experience. By continuing to browse this site you agree to our us... |
| 100453 | Will new Portuguese administration reduce EU funding for renewables? – pv magazine International | ['With ambitious decarbonization targets and a favourable regulatory landscape, Portugal is an a... | pv-magazine | https://www.pv-magazine.com/2024/04/09/will-new-portuguese-administration-reduce-eu-funding-for-... | With ambitious decarbonization targets and a favourable regulatory landscape, Portugal is an app... |
| 100455 | How can policy help Portugal decarbonize? – pv magazine International | ['With ambitious decarbonization targets and a favourable regulatory landscape, Portugal is an a... | pv-magazine | https://www.pv-magazine.com/2024/04/09/will-new-portuguese-administration-reduce-eu-funding-for-... | With ambitious decarbonization targets and a favourable regulatory landscape, Portugal is an app... |
| 92336 | Renewable Energy Storage ( Energy Storage) Articles | ["YTL, as an company of electricity meter, this Kenya exhibition brought us some Sentiment. The ... | energy-xprt | https://www.energy-xprt.com/energy-storage/renewable-energy-storage/articles | YTL, as an company of electricity meter, this Kenya exhibition brought us some Sentiment. The th... |
| 89383 | Renewable Energy Storage ( Renewable Energy) Articles | ["YTL, as an company of electricity meter, this Kenya exhibition brought us some Sentiment. The ... | energy-xprt | https://www.energy-xprt.com/renewable-energy/renewable-energy-storage/articles | YTL, as an company of electricity meter, this Kenya exhibition brought us some Sentiment. The th... |
85 rows × 5 columns
Our analysis uncovers additional insights regarding content duplication. We observe cases where seemingly identical articles are reposted on the same domain but with different titles (excluding the "sgvoice.energyvoice.com" vs. "energyvoice.com" scenario previously addressed). Here, we'll strategically keep these duplicates where contents are the same but titles are different.
Importance of Titles
We keep these duplicate articles because titles can hold information relevant for our RAG pipeline. Consider a scenario where a user query uses an abbreviation, while the corresponding article only contains the abbreviation in the title, in the content always the full term is used. To bridge this gap, we'll prepend titles to the article content during preprocessing. This ensures that the retrieval process considers not only the content itself, but also the potentially informative titles.
Next Step
As previously noted, some articles exhibit standardized introductions, possibly artifacts of the data scraping process. We'll develop appropriate techniques to handle these introductions during preprocessing, ensuring they don't hinder the effectiveness of our RAG pipeline.
articles_df.article.map(lambda x: x[:50]).value_counts()
| count | |
|---|---|
| article | |
| By clicking `` Allow All '' you agree to the stori | 1365 |
| We use cookies to enhance your experience. By cont | 865 |
| Sign in to get the best natural gas news and data. | 295 |
| Create a free IEA account to download our reports | 173 |
| window.dojoRequire ( [ `` mojo/signup-forms/Loader | 31 |
| ... | ... |
| My family used to say I had an obsession with Tesl | 1 |
| Poland as part of a second phase of offshore wind | 1 |
| EOIs are being accepted by KenGen for land leases | 1 |
| Avangrid has achieved commercial operation on the | 1 |
| The perovskite-silicon tandem device has a two ter | 1 |
15021 rows × 1 columns
artifacts = [
"By clicking `` Allow All '' you agree to the sto",
"Sign in to get the best natural gas news and dat",
"window.dojoRequire ( [ `` mojo/signup-forms/Load"
]
for artifact in artifacts:
print(wrap_text(articles_df[articles_df.article.str.startswith(artifact)].article.iloc[0][:500]))
print()
By clicking `` Allow All '' you agree to the storing of cookies on your device to enhance site
navigation, analyse site usage and support us in providing free open access scientific content.
More info. Waga Energy is joining forces with Steuben County ( New York, USA) on a Renewable
Natural Gas ( RNG) project at the county’ s Bath landfill. The RNG produced will be injected in the
local grid and used as a biofuel to decarbonize mobility. The Steuben County landfill will be the
first in the US to
Sign in to get the best natural gas news and data. Follow the topics you want and receive the daily
emails. Your email address * Your password * Remember me Continue Reset password Featured Content
News & Data Services Client Support Daily GPI Infrastructure | NGI All News Access Electric
transmission planners with the Eastern Interconnection, a major power grid serving two-thirds of
the United States and Canada, recently gave a favorable review in their assessment of how well
regional plans mes
window.dojoRequire ( [ `` mojo/signup-forms/Loader '' ], function ( L) { L.start ( { `` baseUrl '':
'' mc.us4.list-manage.com '', '' uuid '': '' 2a6df7ce0f3230ba1f5efe12c '', '' lid '': '' 1e23cc3ebd
'', '' uniqueMethods '': true }) }) The project will contribute an estimated 3.7 percent of energy
to Azerbaijan’ s total national grid capacity, powering 300,000 households The Azerbaijani Ministry
of Energy and ACWA Power of Saudi Arabia held a groundbreaking ceremony for the 240 MW wind power
pla
def remove_scrapping_artifacts(df: pd.DataFrame, column: str) -> pd.DataFrame:
text_artifacts = [
"By clicking `` Allow All '' you agree to the storing of cookies on your device to enhance site navigation, analyse site usage and support us in providing free open access scientific content. More info.",
"Sign in to get the best natural gas news and data. Follow the topics you want and receive the daily emails. Your email address * Your password * Remember me Continue Reset password Featured Content News & Data Services Client Support"
]
regex_artifacts = [
r"window.dojoRequire \( \[ .*\}\) \}\) "
]
for pattern in text_artifacts:
articles_df[column] = articles_df[column].str.replace(pattern, '', regex=False)
for pattern in regex_artifacts:
articles_df[column] = articles_df[column].str.replace(pattern, '', regex=True)
return df
articles_df = remove_scrapping_artifacts(articles_df, "article")
articles_df.article.map(lambda x: x[:50]).value_counts()
| count | |
|---|---|
| article | |
| We use cookies to enhance your experience. By cont | 865 |
| Create a free IEA account to download our reports | 173 |
| Welcome to Edinburgh Instruments’ monthly blog cel | 26 |
| Hydrogen Technology Expo & Carbon Capture Technolo | 22 |
| Over the last year Kipp & Zonen has received a lot | 21 |
| ... | ... |
| If an electric car charges while driving, the siz | 1 |
| A Brazilian study sets the stage for increased ef | 1 |
| The Windhager PuroWIN is the first wood chip boile | 1 |
| Envetec Sustainable Technologies Limited ( `` Env | 1 |
| The perovskite-silicon tandem device has a two ter | 1 |
16569 rows × 1 columns
Our efforts have successfully eliminated a substantial portion of the scrapping artifacts within the articles. However, some traces still persist, likely remnants of past website navigation structures. While removing these remaining artifacts could offer further refinement, it also presents a significant challenge. Therefore, we'll acknowledge this for now and move onto further preprocessing such as filtering out articles that are not written in English.
articles_df["lang"] = articles_df["article"].map(detect)
articles_df["lang"].value_counts()
| count | |
|---|---|
| lang | |
| en | 20106 |
| de | 3 |
| ru | 1 |
| es | 1 |
We will first inspect the language-specific assessment of our texts.
articles_df[articles_df["lang"] != "en"]
| title | content | domain | url | article | lang | |
|---|---|---|---|---|---|---|
| 82317 | Open Letter to Presidents Putin, Biden, Zelenskyy and Lukashenko - EUROSOLAR | ['EUROSOLAR, the European Association for Renewable Energy, calls for an immediate Climate Cease... | eurosolar | https://www.eurosolar.de/sektionen/russland/ | EUROSOLAR, the European Association for Renewable Energy, calls for an immediate Climate Cease F... | ru |
| 126129 | SMS group liefert Prozesstechnologie für das erste klimaneutrale Stahlwerk weltweit -- Expo for ... | ['© SMS group liefert Prozesstechnologie für das klimaneutrale Stahlwerk in Schweden. ( Quelle: ... | decarbxpo | https://www.decarbxpo.com/en/News_Media/Magazine/Stories/SMS_group_liefert_Prozesstechnologie_fü... | © SMS group liefert Prozesstechnologie für das klimaneutrale Stahlwerk in Schweden. ( Quelle: SM... | de |
| 82320 | Internationale Konferenz für Energiespeicher mit Erneuerbaren Energien ( International Renewable... | ['Die nun zu Ende gegangene „ Internationale Erneuerbare Energiespeicher Konferenz “ ( IRES), wi... | eurosolar | https://www.eurosolar.de/2022/09/26/internationale-konferenz-fuer-energiespeicher-mit-erneuerbar... | Die nun zu Ende gegangene „ Internationale Erneuerbare Energiespeicher Konferenz “ ( IRES), widm... | de |
| 82321 | Presentations, Poster and Photos of the IRES 2022 | ['Photos from the IRES ( Copyright EUROSOLAR e.V.)', 'Molten Salt Electrolyte in Na-ZnCl2 Solid-... | eurosolar | https://www.eurosolar.de/2022/10/20/presentations-poster-and-photos-of-the-ires-2022/ | Photos from the IRES ( Copyright EUROSOLAR e.V.) Molten Salt Electrolyte in Na-ZnCl2 Solid-Elect... | de |
| 101246 | Solar + Storage España 2024 – pv magazine International | ['The event will feature a conference, workshops, and an exhibition dedicated to Distributed Gen... | pv-magazine | https://www.pv-magazine.com/pv-magazine-events/sse24/ | The event will feature a conference, workshops, and an exhibition dedicated to Distributed Gener... | es |
print(wrap_text(articles_df[articles_df["lang"] != "en"].iloc[1]["article"][1000:]))
ne bedeutende CO2-Reduzierung über die gesamte Prozesskette erreichen und uns damit von unseren Marktbegleitern differenzieren. Wir haben uns bei diesem Projekt für SMS group entschieden, weil uns ihre Expertise überzeugt hat, die sie in zahlreichen Projekten weltweit unter Beweis gestellt haben. “ „ Wir sind sehr stolz darauf, die Technologie für das erste komplett klimaneutrale Stahlwerk der Welt zu liefern “, sagte Burkhard Dahmen, Vorstandsvorsitzender der SMS group. „ Dies ist nicht nur ein wichtiger Schritt für H2 Green Steel, sondern auch eine ausgezeichnete Gelegenheit, unsere Kompetenz und unsere Mission der grünen Stahlerzeugung zu unterstreichen. “ SMS group wird eine Midrex®-Direktreduktionsanlage, ein Elektrostahlwerk, eine CSP® Nexus-Gieß-Walzanlage sowie einen fortschrittlichen Kaltwalz- und Bandanlagen-Komplex für die Produktion eines breiten Produktmixes einschließlich hochfester Stahlgüten ( AHSS) und Stahlgüten für die Automobilindustrie liefern. Führende Automobilhersteller haben bereits mit H2 Green Steel Vereinbarungen über die Lieferung von „ grünem “, hochwertigem Bandstahl unterzeichnet. Das Gesamtauftragsvolumen für SMS group liegt bei über 1 Milliarde Euro. Der Standort für H2 Green Steel wird ein rund 300 Hektar großes Greenfield-Gelände in Boden in der schwedischen Region Norbotten sein. Die Anlage wird voraussichtlich ab 2025 grünen Stahl produzieren und das Produktionsvolumen in 2026 ausbauen.Burkhard Dahmen: „ Grüner Stahl auf Wasserstoffbasis ist die Zukunft der Primärstahlerzeugung. Wir alle arbeiten mit Hochdruck daran, die Schlüsseltechnologien zu liefern, und so ein neues Zeitalter der Stahlerzeugung zu einzuleiten. Wir freuen uns, die Partnerschaft mit dem Team von H2 Green Steel fortzusetzen und dieses revolutionäre Projekt gemeinsam zu verwirklichen. “ To use the full function of this web site, JavaScript needs to be enabled in your browser. This is how you enable JavaScript in your browser settings: Read instruction To use the full function of this web site, JavaScript needs to be enabled in your browser.
articles_df = articles_df[articles_df["lang"] == "en"]
Patent Datset¶
Here we focus on the preprocessing of Patent Dataset which has completely different format/structure than Media Dataset.
Removing the duplicates: We found out that there were a quite a lot duplicate data
# Keep only one row for each `publication_number`, ignoring differences in `cpc_code` or `inventor`
print(f"Number of rows before: {len(patents_df)}")
patents_df_unique = patents_df.drop_duplicates(subset=["publication_number","application_number","country_code", "title", "abstract", "publication_date"])
print(f"Number of rows after: {len(patents_df_unique)}")
Number of rows before: 406857 Number of rows after: 68125
publication_number_counts = patents_df_unique.groupby('publication_number').size().sort_values(ascending=False)
print(publication_number_counts)
publication_number
WO-2024027062-A1 10
WO-2024131301-A1 10
WO-2022183304-A1 10
EP-4052298-A1 8
WO-2022265333-A1 7
..
US-11828147-B2 1
US-11828138-B2 1
US-11824484-B2 1
US-11824363-B2 1
ZA-202309063-B 1
Length: 31366, dtype: int64
Language detection and Filtering out other languages other than English.
# patents_df_unique.to_csv(bronze_folder / "cleantech_patents_with_lang.csv", index=False)
output_path = bronze_folder / "cleantech_patents_with_lang.csv"
if output_path.exists():
patents_df_unique = pd.read_csv(output_path)
else:
print("dataset not found. Running language detection...")
# Detect language of title and abstract
patents_df_unique["title_lang"] = patents_df_unique["title"].apply(detect)
patents_df_unique["abstract_lang"] = patents_df_unique["abstract"].apply(detect)
# Save full dataset
patents_df_unique.to_csv(output_path, index=False)
print(f"Saved full dataset with language detection: {output_path.name}")
# Filter only English rows
patents_df_unique_en = patents_df_unique[
(patents_df_unique["title_lang"] == "en") &
(patents_df_unique["abstract_lang"] == "en")
]
Patent Topic modelling¶
As for media dataset we are given with Domain , but in case of there are no domains defined so we did topic modelling to group the articles
patents_df_unique_en.describe()
| publication_date | |
|---|---|
| count | 2.868500e+04 |
| mean | 2.022829e+07 |
| std | 7.448034e+03 |
| min | 2.022010e+07 |
| 25% | 2.022080e+07 |
| 50% | 2.023032e+07 |
| 75% | 2.023111e+07 |
| max | 2.024090e+07 |
patents_df_unique_en.info()
<class 'pandas.core.frame.DataFrame'> Index: 28685 entries, 1 to 68123 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 publication_number 28685 non-null object 1 application_number 28685 non-null object 2 country_code 28685 non-null object 3 title 28685 non-null object 4 abstract 28685 non-null object 5 publication_date 28685 non-null int64 6 inventor 28685 non-null object 7 cpc_code 28685 non-null object 8 title_lang 28685 non-null object 9 abstract_lang 28685 non-null object dtypes: int64(1), object(9) memory usage: 2.4+ MB
preprocess patent data¶
exploring the dataset and removing the non needed fields from the dataset
# Assuming patents_df_unique is already loaded with the data you pasted
patents_df = patents_df_unique_en.copy()
# Combine title + abstract
patents_df["text"] = patents_df["title"].fillna("") + ". " + patents_df["abstract"].fillna("")
# Basic preprocessing
stop_words = set(stopwords.words("english"))
def preprocess_text(text, apply_stemming=False):
"""
Preprocess a given text by:
- Converting to lowercase
- Tokenizing
- Removing punctuation and stop words
- Applying lemmatization (and optionally stemming)
"""
if not isinstance(text, str):
return []
text = text.lower()
tokens = word_tokenize(text)
tokens = [token for token in tokens if token not in string.punctuation]
tokens = [token for token in tokens if token not in stop_words]
tokens = [lemmatizer.lemmatize(token) for token in tokens]
if apply_stemming:
tokens = [stemmer.stem(token) for token in tokens]
# Join the tokens back into a single string
return " ".join(tokens)
patents_df["cleaned_text"] = patents_df["text"].apply(preprocess_text)
Vectorize with TF-IDF¶
vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = vectorizer.fit_transform(patents_df["cleaned_text"])
Extract Topics with NMF¶
Only categorizing in 10 topics : defined as 0 -9
n_topics = 10
nmf_model = NMF(n_components=n_topics, random_state=42)
nmf_features = nmf_model.fit_transform(tfidf)
# Show top keywords per topic
for i, topic in enumerate(nmf_model.components_):
top_keywords = [vectorizer.get_feature_names_out()[index] for index in topic.argsort()[-10:]]
print(f"Topic {i}: {', '.join(top_keywords)}")
Topic 0: connected, end, body, fixedly, panel, frame, arranged, solar, rod, plate Topic 1: storage, temperature, exchange, solar, geothermal, air, energy, pump, heating, heat Topic 2: device, charging, battery, supply, electric, solar, generation, storage, energy, power Topic 3: air, energy, fan, speed, generation, turbine, power, generator, blade, wind Topic 4: carbon, liquid, cell, electrolysis, gas, energy, fuel, storage, production, hydrogen Topic 5: pipeline, valve, heating, outlet, device, inlet, pump, pipe, tank, water Topic 6: method, electrode, preparation, material, substrate, film, silicon, solar, cell, layer Topic 7: roof, cleaning, frame, assembly, angle, generation, support, board, panel, photovoltaic Topic 8: time, monitoring, based, step, scheduling, operation, method, data, hydropower, station Topic 9: management, connected, acquisition, supply, detection, used, data, monitoring, control, module
patents_df["topic"] = nmf_features.argmax(axis=1)
# Show topic distribution
topic_distribution = patents_df["topic"].value_counts().sort_index()
print("\nTopic distribution:")
print(topic_distribution)
Topic distribution: topic 0 4927 1 3653 2 3139 3 2680 4 1868 5 2734 6 3154 7 2151 8 3042 9 1337 Name: count, dtype: int64
colors = plt.cm.viridis(np.linspace(0, 1, len(topic_distribution))) # Use viridis colormap
plt.bar(topic_distribution.index, topic_distribution.values, color=colors)
plt.xlabel("Topic Number")
plt.ylabel("Number of Documents")
plt.title("Topic Distribution")
plt.show()
Patent Dates¶
patents_df["publication_date"] = pd.to_datetime(patents_df["publication_date"].astype(str), errors='coerce') # Use errors='coerce' to handle invalid dates
# Similar to articles_df date handling:
patents_time_df = patents_df.groupby("publication_date").size().reset_index()
patents_time_df.columns = ["publication_date", "count"]
patents_time_df.describe()
| publication_date | count | |
|---|---|---|
| count | 626 | 626.000000 |
| mean | 2023-04-27 16:10:44.089456896 | 45.822684 |
| min | 2022-01-01 00:00:00 | 1.000000 |
| 25% | 2022-08-27 00:00:00 | 3.000000 |
| 50% | 2023-04-26 12:00:00 | 13.000000 |
| 75% | 2023-12-25 12:00:00 | 94.000000 |
| max | 2024-09-05 00:00:00 | 241.000000 |
| std | NaN | 52.190349 |
sns.lineplot(data=patents_time_df, x="publication_date", y="count")
plt.title("Patent Article Count Over Time")
plt.xlabel("Publication Date")
plt.xticks(rotation=90)
plt.ylabel("Patent Article Count")
# add a line for the average
avg_count = patents_time_df["count"].mean()
plt.axhline(avg_count, color='r', linestyle='--', label=f"Average article count per day: {avg_count:.2f}")
plt.legend()
plt.show()
removing date , as no relevance for our task and also has 52 Nan values
patents_df = patents_df.drop(columns=["publication_date"])
Patent CPC code and inventor¶
from collections import defaultdict
inventor_counts = defaultdict(int)
for inventor_list in patents_df['inventor']:
if inventor_list:
for inventor in inventor_list:
inventor_counts[inventor] += 1
# Convert the dictionary to a DataFrame
patents_new_df = pd.DataFrame(inventor_counts.items(), columns=['inventor', 'count'])
patents_new_df.describe()
| count | |
|---|---|
| count | 113.000000 |
| mean | 12711.584071 |
| std | 32320.692165 |
| min | 1.000000 |
| 25% | 4.000000 |
| 50% | 353.000000 |
| 75% | 8257.000000 |
| max | 196197.000000 |
patents_cpc_df = patents_df.groupby("cpc_code").size().reset_index()
patents_cpc_df.columns = ["cpc_code", "count"]
patents_cpc_df.describe()
| count | |
|---|---|
| count | 6045.000000 |
| mean | 4.745244 |
| std | 47.672810 |
| min | 1.000000 |
| 25% | 1.000000 |
| 50% | 1.000000 |
| 75% | 2.000000 |
| max | 3200.000000 |
Removing more unwanted columns
patents_df = patents_df.drop(columns=["cpc_code", "inventor"])
Patent Titles¶
Here we found out that
sns.histplot(patents_df["title"].str.len())
plt.title("Title Length Distribution")
plt.xlabel("Title Length")
plt.ylabel("Count")
avg_count = patents_df["title"].str.len().mean()
plt.axvline(avg_count, color='r', linestyle='--', label=f"Average title length: {avg_count:.2f}")
plt.legend()
plt.show()
patents_df["title"].duplicated().sum()
3404
duplicate_titles = patents_df[patents_df["title"].duplicated(keep=False)].sort_values("title")
duplicate_titles.head(10)
| publication_number | application_number | country_code | title | abstract | title_lang | abstract_lang | text | cleaned_text | topic | |
|---|---|---|---|---|---|---|---|---|---|---|
| 36155 | CN-216591460-U | CN-202123174925-U | CN | 5G communication wisdom street lamp | The utility model relates to the technical field of intelligent street lamps and discloses a 5G ... | en | en | 5G communication wisdom street lamp. The utility model relates to the technical field of intelli... | 5g communication wisdom street lamp utility model relates technical field intelligent street lam... | 0 |
| 61408 | CN-216143672-U | CN-202122011678-U | CN | 5G communication wisdom street lamp | The utility model relates to a street lamp, in particular to a 5G communication intelligent stre... | en | en | 5G communication wisdom street lamp. The utility model relates to a street lamp, in particular t... | 5g communication wisdom street lamp utility model relates street lamp particular 5g communicatio... | 0 |
| 21089 | CN-217164845-U | CN-202220079258-U | CN | A breaker for hydraulic and hydroelectric engineering | The utility model belongs to the technical field of a crushing device, in particular to a crushi... | en | en | A breaker for hydraulic and hydroelectric engineering. The utility model belongs to the technica... | breaker hydraulic hydroelectric engineering utility model belongs technical field crushing devic... | 0 |
| 23321 | CN-216936150-U | CN-202220328747-U | CN | A breaker for hydraulic and hydroelectric engineering | The utility model provides a crusher for water conservancy and hydropower engineering, which rel... | en | en | A breaker for hydraulic and hydroelectric engineering. The utility model provides a crusher for ... | breaker hydraulic hydroelectric engineering utility model provides crusher water conservancy hyd... | 0 |
| 35109 | CN-216654721-U | CN-202122764875-U | CN | A breaker for hydraulic and hydroelectric engineering | The utility model discloses a crusher for water conservancy and hydropower engineering, which co... | en | en | A breaker for hydraulic and hydroelectric engineering. The utility model discloses a crusher for... | breaker hydraulic hydroelectric engineering utility model discloses crusher water conservancy hy... | 0 |
| 9925 | CN-218833988-U | CN-202223011290-U | CN | A dust device for hydraulic and hydroelectric engineering construction | The utility model discloses a dust device for hydraulic and hydroelectric engineering constructi... | en | en | A dust device for hydraulic and hydroelectric engineering construction. The utility model disclo... | dust device hydraulic hydroelectric engineering construction utility model discloses dust device... | 5 |
| 35110 | CN-216653908-U | CN-202220115116-U | CN | A dust device for hydraulic and hydroelectric engineering construction | The utility model relates to a dust-settling device for water conservancy and hydropower enginee... | en | en | A dust device for hydraulic and hydroelectric engineering construction. The utility model relate... | dust device hydraulic hydroelectric engineering construction utility model relates dust-settling... | 5 |
| 8485 | CN-114506947-B | CN-202210069053-A | CN | A filtration system for hydraulic and hydroelectric engineering | The invention belongs to the technical field of filtering systems, and discloses a filtering sys... | en | en | A filtration system for hydraulic and hydroelectric engineering. The invention belongs to the te... | filtration system hydraulic hydroelectric engineering invention belongs technical field filterin... | 0 |
| 36485 | CN-114506947-A | CN-202210069053-A | CN | A filtration system for hydraulic and hydroelectric engineering | The invention belongs to the technical field of filter systems, and discloses a filter system fo... | en | en | A filtration system for hydraulic and hydroelectric engineering. The invention belongs to the te... | filtration system hydraulic hydroelectric engineering invention belongs technical field filter s... | 0 |
| 19335 | CN-114955614-A | CN-202210693867-A | CN | A industrial dust collector for feed bin top | The invention relates to the technical field of industrial dust collectors, in particular to an ... | en | en | A industrial dust collector for feed bin top. The invention relates to the technical field of in... | industrial dust collector feed bin top invention relates technical field industrial dust collect... | 0 |
duplicate_titles["abstract"].duplicated().sum()
1365
for better readability wrap text
def wrap_text(text: str, char_per_line=100) -> str:
# for better readability, wrap the text at the last space before the char_per_line
if len(text) < char_per_line:
return text
else:
return text[:char_per_line].rsplit(' ', 1)[0] + '\n' + wrap_text(text[len(text[:char_per_line].rsplit(' ', 1)[0])+1:], char_per_line)
print(duplicate_titles.iloc[0]["title"])
print(wrap_text(duplicate_titles.iloc[0]["abstract"]))
5G communication wisdom street lamp The utility model relates to the technical field of intelligent street lamps and discloses a 5G communication intelligent street lamp which comprises a lamp post, wherein a first illuminating lamp is installed at the upper end of the left side of the lamp post, a second illuminating lamp is installed at the upper end of the right side of the lamp post, the height of the second illuminating lamp is larger than that of the first illuminating lamp, solar panels are installed at the left end and the right end of the upper side of the lamp post, a charging box is installed at the bottom side of the lamp post, and an electric quantity display screen is arranged at the front side of the charging box. According to the utility model, a large amount of solar energy collected by the solar panel is converted into electric energy to be stored in the storage battery in the charging box, so that not only can power be supplied to a plurality of groups of electric appliances on the lamp post, but also the electric vehicle can be charged, the trouble of places without charging of the electric vehicle is greatly reduced, meanwhile, the charging box is provided with the emergency button, the camera can be driven to photograph in real time and is transmitted to the monitoring station through the internet, the illegal copies can be captured in the first time, thus, the illegal activities of some illegal copies can be prevented, and the crime rate is reduced.
print(duplicate_titles.iloc[1]["title"])
print(wrap_text(duplicate_titles.iloc[1]["abstract"]))
5G communication wisdom street lamp The utility model relates to a street lamp, in particular to a 5G communication intelligent street lamp. The utility model provides a 5G communication intelligent street lamp capable of tracking the angle of the sun. A5G communication intelligent street lamp comprises a lamp post, a fixed seat, a bolt, a lamp holder, an adjustable illuminating lamp, a monitoring device and the like; the lamp pole lower extreme fixedly connected with fixing base wears to be equipped with a plurality of bolts on the fixing base, and lamp pole upper portion fixedly connected with lamp stand, lamp stand downside are equipped with adjustable light, and lamp pole upper portion is located lamp stand upside fixedly connected with mount, fixedly connected with monitoring devices on the mount, and the lamp pole upper end is equipped with solar energy power supply mechanism. According to the solar energy power generation device, the light sensing device senses the change of the position of the sun, the stepping motor is controlled to drive the solar power generation panel to adjust the angle of the solar power generation panel on the horizontal plane, and the motor is controlled to drive the solar power generation panel to adjust the angle of the solar power generation panel on the vertical plane, so that the solar energy resources can be fully utilized.
Patents Abstract¶
As we can see here, Patent abstarcts are way smaller than the media Articles
np.random.seed(7)
random_sample_id = np.random.choice(patents_df.index)
print(wrap_text(patents_df.loc[random_sample_id, "abstract"]))
The invention discloses a photovoltaic inverter monitoring control system and method based on an acquisition terminal, which are mainly applied to a low-voltage distribution area in the field of power systems. The system mainly comprises: the system comprises a collection terminal, a carrier-to-Modbus converter, a photovoltaic inverter and distributed photovoltaics, and is characterized in that the systems can communicate with each other through a power line and can realize local control of the photovoltaic inverter; according to the method, a collection terminal carries out limit value judgment after receiving a limit value parameter issued by a main station, periodically collected photovoltaic inverter operation parameters are compared with the limit value parameter, the collection terminal carries out prediction judgment after receiving a prediction parameter issued by the main station, the counted photovoltaic power generation capacity of the photovoltaic inverter is compared with the predicted power generation capacity, control is executed according to a control strategy finally, and related control events are reported to the main station.
patents_df_unique.head()
| publication_number | application_number | country_code | title | abstract | publication_date | inventor | cpc_code | title_lang | abstract_lang | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CN-117138249-A | CN-202311356270-A | CN | 一种石墨烯光疗面罩 | The application provides a graphene phototherapy mask, and relates to the technical field of pho... | 20231201 | ['LI HAITAO', 'CAO WENQIANG'] | A61N2005/0654 | zh-cn | en |
| 1 | CN-117151396-A | CN-202311109834-A | CN | Distributed economic scheduling method for wind, solar, biogas and hydrogen multi-energy multi-m... | The invention discloses a distributed economic dispatching method of a wind, solar and methane h... | 20231201 | ['HU PENGFEI', 'LI ZIMENG'] | G06Q50/06 | en | en |
| 2 | CN-117141530-A | CN-202310980795-A | CN | 氢能源动力轨道车辆组 | The invention discloses a hydrogen energy power rail vehicle group, which comprises a power vehi... | 20231201 | ['XIE BO', 'ZHANG SHUIQING', 'ZHOU FEI', 'LIU YONG', 'Zhou Houyi'] | Y02T90/40 | zh-cn | en |
| 3 | CN-117141244-A | CN-202311177651-A | CN | 一种汽车太阳能充电系统、方法及新能源汽车 | The application discloses an automobile solar charging system, an automobile solar charging meth... | 20231201 | ['ZHAO PENGCHENG'] | B60K16/00 | ko | en |
| 4 | CN-117146094-A | CN-202311272549-A | CN | 一种水利水电管道连接装置 | The invention provides a water conservancy and hydropower pipeline connecting device, which effe... | 20231201 | ['LYU SHUOSHUO', 'LI PANFENG', 'XU ZHENGWEI', 'WANG WEIBIN', 'ZHANG CHEN', 'ZHOU HAIYUN'] | F16L55/02 | zh-cn | en |
Language distribution in Patent Dataset, we will only conside english language
# Count detected languages in title
title_lang_counts = patents_df_unique["title_lang"].value_counts()
print("Title Language Counts:")
print(title_lang_counts)
# Count detected languages in abstract
abstract_lang_counts = patents_df_unique["abstract_lang"].value_counts()
print("\nAbstract Language Counts:")
print(abstract_lang_counts)
Title Language Counts: title_lang en 35123 zh-cn 21467 ko 6583 fr 1404 da 895 it 449 no 446 de 326 ro 269 es 257 nl 166 ca 127 cy 113 sv 100 af 95 ru 68 vi 42 tl 32 ar 22 ja 21 id 17 et 15 pt 15 tr 13 cs 11 pl 10 sl 9 fi 7 el 5 sk 5 lt 4 hr 3 so 2 lv 1 uk 1 sw 1 sq 1 Name: count, dtype: int64 Abstract Language Counts: abstract_lang en 61152 zh-cn 6274 fr 290 ko 283 de 51 ja 17 ar 16 es 16 ru 9 pl 6 cs 2 ro 2 hr 2 pt 2 lv 1 fi 1 sl 1 Name: count, dtype: int64
Our exploration revealed a small number of articles containing non-English content (some in German and 1 with a Russian section). Since most LLMs and embedding models are primarily trained on English text, removing these articles ensures compatibility with our chosen models for this notebook. For simplicity, we'll only focus on supporting English queries and responses within this RAG pipeline.
Challenges of Multilingual RAG Pipelines¶
Introducing multilingual capabilities into a RAG pipeline presents an additional layer of complexity. Here's a breakdown of some key challenges:
- Multilingual Model Support: Both the LLM and embedding models need to be proficient in all target languages (e.g., English and German). The LLM must be able to comprehend and generate text in these languages, while the embedding models should effectively map similar concepts across languages into the same semantic space.
- Prompt Engineering for Multilingual Responses: When a user submits a question in German, for instance, we'd ideally retrieve relevant articles, potentially also in English which can distract the LLM, and utilize prompt engineering to ensure the LLM generates a response in German.
Characters, Tokens and Words¶
Let us further analyze the contents of the articles. However, before we do so let us define the meaning of characters, tokens and words:
- Characters: The smallest unit of text, including letters, numbers, punctuation, and whitespace.
- Tokens: Most NLP models operate on tokens, which are sequences of characters that represent a semantic unit. These units can be words, subwords, or characters. Tokenization is the process of converting text into tokens. To see the tokenization process in action for the OpenAI GPT-4 model, check out the OpenAI GPT-4 Tokenizer.
- Words: Just as in everyday's language, words are the building blocks of text. They are composed of one or more characters and are separated by whitespace.
Media Dataset¶
sns.histplot(articles_df["article"].map(len), kde=True)
plt.title("Amount of characters in articles")
plt.xlabel("Amount of characters")
plt.ylabel("Number of articles")
median_char_len = articles_df["article"].map(len).median()
mean_char_len = articles_df["article"].map(len).mean()
plt.axvline(median_char_len, color='r', linestyle='--', label=f"Median character amount: {median_char_len:.2f}")
plt.axvline(mean_char_len, color='g', linestyle='--', label=f"Mean character amount: {mean_char_len:.2f}")
plt.legend()
plt.show()
sns.histplot(articles_df["article"].map(lambda x: len(x.split())), kde=True)
plt.title("Amount of words in articles")
plt.xlabel("Amount of words")
plt.ylabel("Number of articles")
median_word_len = articles_df["article"].map(lambda x: len(x.split())).median()
mean_word_len = articles_df["article"].map(lambda x: len(x.split())).mean()
plt.axvline(median_word_len, color='r', linestyle='--', label=f"Median word amount: {median_word_len:.2f}")
plt.axvline(mean_word_len, color='g', linestyle='--', label=f"Mean word amount: {mean_word_len:.2f}")
plt.legend()
plt.show()
nlp = English()
tokenizer = nlp.tokenizer
sns.histplot(articles_df["article"].map(lambda x: len(tokenizer(x))), kde=True)
plt.title("Amount of tokens in articles")
plt.xlabel("Amount of tokens")
plt.ylabel("Number of articles")
median_token_len = articles_df["article"].map(lambda x: len(tokenizer(x))).median()
mean_token_len = articles_df["article"].map(lambda x: len(tokenizer(x))).mean()
plt.axvline(median_token_len, color='r', linestyle='--', label=f"Median token amount: {median_token_len:.2f}")
plt.axvline(mean_token_len, color='g', linestyle='--', label=f"Mean token amount: {mean_token_len:.2f}")
plt.legend()
plt.show()
all_tokens = [token.text for article in articles_df["article"] for token in tokenizer(article)]
# remove non-alphabetic tokens such as punctuation
alpha_tokens = [token for token in all_tokens if token.isalpha()]
alpha_tokens = [token.lower() for token in alpha_tokens]
alpha_token_counts = Counter(alpha_tokens)
sns.barplot(
x=[count for token, count in alpha_token_counts.most_common(20)],
y=[token for token, count in alpha_token_counts.most_common(20)],
hue=[token for token, count in alpha_token_counts.most_common(20)]
)
plt.title("Most common alphabetic tokens")
plt.xlabel("Count")
plt.ylabel("Token")
plt.show()
The initial approach returns common words which do not reflect the subject-specific nature of our document collection. We will remove them to understand the content of the texts better.
# remove stopwords such as 'the', 'a', 'and'
non_stop_tokens = [token for token in alpha_tokens if not nlp.vocab[token].is_stop]
non_stop_token_counts = Counter(non_stop_tokens)
sns.barplot(
x=[count for token, count in non_stop_token_counts.most_common(20)],
y=[token for token, count in non_stop_token_counts.most_common(20)],
hue=[token for token, count in non_stop_token_counts.most_common(20)]
)
plt.title("Most common non-stopword tokens")
plt.xlabel("Count")
plt.ylabel("Token")
plt.show()
As one would expect in a dataset of cleantech news articles most of the tokens that are not punctation or stopwords revolve around the subjects of energy, climate, and technology. This is a good sign that the dataset is relevant to the topic at hand. The "s" token comes up frequently, which is likely due to the possessive form of words. With an average of around 700 words per article, we can expect a good amount of information to be present in each article and an average reading time of around 3-4 minutes.
Flesch Reading Ease Score¶
The Flesch Reading Ease Score (FRES, a.k.a Flesch-Kincaid Reading Ease Score) is a heuristic used to evaluate how easy it is to understand a text based on the length of sentences and the number of syllables per word. Scores can range from -100 (very difficult to read) to 100 (very easy to read). Scores below 50 are indicative of difficult texts for College level. This metric can be useful for assessing the readability of our articles and ensuring they are accessible to a broad audience.
articles_df["readability"] = articles_df["article"].apply(flesch_reading_ease)
sns.histplot(articles_df["readability"], kde=True)
plt.title("Flesch Reading Ease of articles")
plt.xlabel("Flesch Reading Ease")
plt.ylabel("Number of articles")
mean_readability = articles_df["readability"].mean()
plt.axvline(mean_readability, color='g', linestyle='--', label=f"Mean readability: {mean_readability:.2f}")
plt.legend()
plt.show()
We analyze now the diversity of language complexity used by different publishing domains.
domains = articles_df["domain"].unique()
# Setup the subplots based on the number of domains
plots_per_row = 3
num_rows = (len(domains) + 2) // plots_per_row
plot_height = 6
fig, axes = plt.subplots(num_rows, plots_per_row, figsize=(plot_height * plots_per_row, plot_height * num_rows))
axes = axes.flatten() # Flatten the axes array for easier iteration
# Plot for each domain
for i, domain in enumerate(domains):
domain_articles = articles_df[articles_df["domain"] == domain]
sns.histplot(domain_articles["readability"], kde=True, ax=axes[i], bins=30)
axes[i].set_title(f'Readability of {domain}')
axes[i].set_xlabel('Flesch Reading Ease Score')
axes[i].set_ylabel("Number of articles")
mean_readability = domain_articles["readability"].mean()
axes[i].axvline(mean_readability, color='g', linestyle='--', label=f"Mean readability: {mean_readability:.2f}")
# remove the empty plots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.show()
To gauge the readability of our articles, we calculated the Flesch Reading Ease Score. The average score of around 45 indicates a "fairly easy" reading level, which is positive news. This suggests the content is likely accessible to a broad audience and, consequently, understandable by our RAG pipeline as well.
Our analysis revealed a consistent average Flesch Reading Ease Score across most of the identified domains, with minor variations. This indicates a relatively consistent level of readability across different publishers within the dataset.
Finally we will save the cleaned dataset to a new file in the data/silver folder.
silver_folder = data_folder / "silver"
if not silver_folder.exists():
silver_folder.mkdir()
articles_df.to_csv(silver_folder / "articles.csv", index=False)
Patent Dataset¶
sns.histplot(patents_df["abstract"].map(len), kde=True)
plt.title("Amount of characters in abstracts")
plt.xlabel("Amount of characters")
plt.ylabel("Number of Abstracts")
median_char_len = patents_df["abstract"].map(len).median()
mean_char_len = patents_df["abstract"].map(len).mean()
plt.axvline(median_char_len, color='r', linestyle='--', label=f"Median character amount: {median_char_len:.2f}")
plt.axvline(mean_char_len, color='g', linestyle='--', label=f"Mean character amount: {mean_char_len:.2f}")
plt.legend()
plt.show()
patents_df[patents_df["abstract"].map(len) <50]
| publication_number | application_number | country_code | title | abstract | title_lang | abstract_lang | text | cleaned_text | topic | |
|---|---|---|---|---|---|---|---|---|---|---|
| 37368 | US-11326019-B1 | US-202117531006-A | US | Fused dithieno benzothiadiazole polymers for organic photovoltaics | A method to produce | en | en | Fused dithieno benzothiadiazole polymers for organic photovoltaics. A method to produce | fused dithieno benzothiadiazole polymer organic photovoltaics method produce | 6 |
| 54119 | WO-2023091151-A1 | US-2021060331-W | WO | Fused dithieno benzothiadiazole polymers for organic photovoltaics | A method to produce Formula (I). | en | en | Fused dithieno benzothiadiazole polymers for organic photovoltaics. A method to produce Formula ... | fused dithieno benzothiadiazole polymer organic photovoltaics method produce formula | 6 |
sns.histplot(patents_df["abstract"].map(lambda x: len(x.split())), kde=True)
plt.title("Amount of words in abstracts")
plt.xlabel("Amount of words")
plt.ylabel("Number of abstracts")
median_word_len = patents_df["abstract"].map(lambda x: len(x.split())).median()
mean_word_len = patents_df["abstract"].map(lambda x: len(x.split())).mean()
plt.axvline(median_word_len, color='r', linestyle='--', label=f"Median word amount: {median_word_len:.2f}")
plt.axvline(mean_word_len, color='g', linestyle='--', label=f"Mean word amount: {mean_word_len:.2f}")
plt.legend()
plt.show()
nlp = English()
tokenizer = nlp.tokenizer
sns.histplot(patents_df["abstract"].map(lambda x: len(tokenizer(x))), kde=True)
plt.title("Amount of tokens in abstracts")
plt.xlabel("Amount of tokens")
plt.ylabel("Number of abstracts")
median_token_len = patents_df["abstract"].map(lambda x: len(tokenizer(x))).median()
mean_token_len = patents_df["abstract"].map(lambda x: len(tokenizer(x))).mean()
plt.axvline(median_token_len, color='r', linestyle='--', label=f"Median token amount: {median_token_len:.2f}")
plt.axvline(mean_token_len, color='g', linestyle='--', label=f"Mean token amount: {mean_token_len:.2f}")
plt.legend()
plt.show()
all_tokens = [token.text for article in patents_df["abstract"] for token in tokenizer(article)]
# remove non-alphabetic tokens such as punctuation
alpha_tokens = [token for token in all_tokens if token.isalpha()]
alpha_tokens = [token.lower() for token in alpha_tokens]
alpha_token_counts = Counter(alpha_tokens)
sns.barplot(
x=[count for token, count in alpha_token_counts.most_common(20)],
y=[token for token, count in alpha_token_counts.most_common(20)],
hue=[token for token, count in alpha_token_counts.most_common(20)]
)
plt.title("Most common alphabetic tokens")
plt.xlabel("Count")
plt.ylabel("Token")
plt.show()
# remove stopwords such as 'the', 'a', 'and'
non_stop_tokens = [token for token in alpha_tokens if not nlp.vocab[token].is_stop]
non_stop_token_counts = Counter(non_stop_tokens)
sns.barplot(
x=[count for token, count in non_stop_token_counts.most_common(20)],
y=[token for token, count in non_stop_token_counts.most_common(20)],
hue=[token for token, count in non_stop_token_counts.most_common(20)]
)
plt.title("Most common non-stopword tokens")
plt.xlabel("Count")
plt.ylabel("Token")
plt.show()
In patent articles/abstractsthe plot shows , it has lot of stopwords and less technical word Frequency. Although most common words in patent datset are same as media dataset.
Flesch Reading Ease Score¶
The plot down does not show good results. Flesch readability should be between 0 and 100, and the more is better. But here we can see the some paragraphs in patent has readability below 0 and ranges from -200 to -100. Media datset is definitely better has better readability.
patents_df["readability"] = patents_df["abstract"].apply(flesch_reading_ease)
sns.histplot(patents_df["readability"], kde=True)
plt.title("Flesch Reading Ease of abstracts")
plt.xlabel("Flesch Reading Ease")
plt.ylabel("Number of articles")
mean_readability = patents_df["readability"].mean()
plt.axvline(mean_readability, color='g', linestyle='--', label=f"Mean readability: {mean_readability:.2f}")
plt.legend()
plt.show()
Flesch Redability distribution per Topic¶
The plot shows its not good for any of the topics. Not comparable to the media dataset at all.
domains = patents_df["topic"].unique()
# Setup the subplots based on the number of domains
plots_per_row = 3
num_rows = (len(domains) + 2) // plots_per_row
plot_height = 6
fig, axes = plt.subplots(num_rows, plots_per_row, figsize=(plot_height * plots_per_row, plot_height * num_rows))
axes = axes.flatten() # Flatten the axes array for easier iteration
# Plot for each domain
for i, domain in enumerate(domains):
domain_articles = patents_df[patents_df["topic"] == domain]
sns.histplot(domain_articles["readability"], kde=True, ax=axes[i], bins=30)
axes[i].set_title(f'Readability of {domain}')
axes[i].set_xlabel('Flesch Reading Ease Score')
axes[i].set_ylabel("Number of articles")
mean_readability = domain_articles["readability"].mean()
axes[i].axvline(mean_readability, color='g', linestyle='--', label=f"Mean readability: {mean_readability:.2f}")
# remove the empty plots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.show()
silver_folder = data_folder / "silver"
if not silver_folder.exists():
silver_folder.mkdir()
patents_df.to_csv(silver_folder / "abstracts.csv", index=False)
Evaluation Data¶
Next we will analyze the provided evaluation data and ensure that they match the content of the articles.
human_eval_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 23 entries, 0 to 22 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 example_id 23 non-null int64 1 question_id 23 non-null int64 2 question 23 non-null object 3 relevant_text 23 non-null object 4 answer 23 non-null object 5 article_url 23 non-null object dtypes: int64(2), object(4) memory usage: 1.2+ KB
human_eval_df.rename(columns={"relevant_text":"relevant_section","article_url": "url"}, inplace=True)
human_eval_df.drop(columns=["question_id"], inplace=True)
human_eval_df.head()
| example_id | question | relevant_section | answer | url | |
|---|---|---|---|---|---|
| 0 | 1 | What is the innovation behind Leclanché's new method to produce lithium-ion batteries? | Leclanché said it has developed an environmentally friendly way to produce lithium-ion (Li-ion) ... | Leclanché's innovation is using a water-based process instead of highly toxic organic solvents t... | https://www.sgvoice.net/strategy/technology/23971/leclanches-new-disruptive-battery-boosts-energ... |
| 1 | 2 | What is the EU’s Green Deal Industrial Plan? | The Green Deal Industrial Plan is a bid by the EU to make its net zero industry more competitive... | The EU’s Green Deal Industrial Plan aims to enhance the competitiveness of its net zero industry... | https://www.sgvoice.net/policy/25396/eu-seeks-competitive-boost-with-green-deal-industrial-plan/ |
| 2 | 3 | What is the EU’s Green Deal Industrial Plan? | The European counterpart to the US Inflation Reduction Act (IRA) aims to create an environment t... | The EU’s Green Deal Industrial Plan aims to enhance the competitiveness of its net zero industry... | https://www.pv-magazine.com/2023/02/02/european-commission-introduces-green-deal-industrial-plan/ |
| 3 | 4 | What are the four focus areas of the EU's Green Deal Industrial Plan? | The new plan is fundamentally focused on four areas, or pillars: the regulatory environment, acc... | The four focus areas of the EU's Green Deal Industrial Plan are the regulatory environment, acce... | https://www.sgvoice.net/policy/25396/eu-seeks-competitive-boost-with-green-deal-industrial-plan/ |
| 4 | 5 | When did the cooperation between GM and Honda on fuel cell vehicles start? | What caught our eye was a new hookup between GM and Honda. Honda was also hammering away at the ... | July 2013 | https://cleantechnica.com/2023/05/08/general-motors-seizes-the-fuel-cell-moment-with-green-hydro... |
sns.histplot(human_eval_df["question"].map(len), kde=True)
plt.title("Question Character Length Distribution")
plt.xlabel("Character Length")
plt.ylabel("Count")
mean_char_len = human_eval_df["question"].map(len).mean()
plt.axvline(mean_char_len, color='r', linestyle='--', label=f"Mean character amount: {mean_char_len:.2f}")
plt.legend()
plt.show()
missing_articles = human_eval_df.copy()
missing_articles = missing_articles[~human_eval_df["url"].isin(articles_df["url"])]
missing_articles
| example_id | question | relevant_section | answer | url | |
|---|---|---|---|---|---|
| 0 | 1 | What is the innovation behind Leclanché's new method to produce lithium-ion batteries? | Leclanché said it has developed an environmentally friendly way to produce lithium-ion (Li-ion) ... | Leclanché's innovation is using a water-based process instead of highly toxic organic solvents t... | https://www.sgvoice.net/strategy/technology/23971/leclanches-new-disruptive-battery-boosts-energ... |
| 1 | 2 | What is the EU’s Green Deal Industrial Plan? | The Green Deal Industrial Plan is a bid by the EU to make its net zero industry more competitive... | The EU’s Green Deal Industrial Plan aims to enhance the competitiveness of its net zero industry... | https://www.sgvoice.net/policy/25396/eu-seeks-competitive-boost-with-green-deal-industrial-plan/ |
| 3 | 4 | What are the four focus areas of the EU's Green Deal Industrial Plan? | The new plan is fundamentally focused on four areas, or pillars: the regulatory environment, acc... | The four focus areas of the EU's Green Deal Industrial Plan are the regulatory environment, acce... | https://www.sgvoice.net/policy/25396/eu-seeks-competitive-boost-with-green-deal-industrial-plan/ |
| 22 | 23 | Which has the higher absorption coefficient for wavelengths above 500m - amorphous germanium or ... | We chose amorphous germanium instead of amorphous silicon as absorber material because of its hi... | amorphous germanium | https://www.pv-magazine.com/2021/01/15/germanium-based-solar-cell-tech-for-agrivoltaics/#respond |
Our exploration has identified instances where articles linked to specific questions appear to be missing from the dataset. To determine the root cause, let's investigate whether these articles are genuinely absent or if inconsistencies in URL formatting are creating the illusion of missing data. Normalizing the URLs across the dataset will help us differentiate between these two scenarios.
def normalize_url(url: str) -> str:
url = url.replace("https://", "")
url = url.replace("http://", "")
url = url.replace("www.", "")
url = url.rstrip("/")
return url
articles_df["url"] = articles_df["url"].map(normalize_url)
human_eval_df["url"] = human_eval_df["url"].map(normalize_url)
missing_articles = human_eval_df.copy()
missing_articles = missing_articles[~human_eval_df["url"].isin(articles_df["url"])]
missing_articles
| example_id | question | relevant_section | answer | url | |
|---|---|---|---|---|---|
| 0 | 1 | What is the innovation behind Leclanché's new method to produce lithium-ion batteries? | Leclanché said it has developed an environmentally friendly way to produce lithium-ion (Li-ion) ... | Leclanché's innovation is using a water-based process instead of highly toxic organic solvents t... | sgvoice.net/strategy/technology/23971/leclanches-new-disruptive-battery-boosts-energy-density |
| 1 | 2 | What is the EU’s Green Deal Industrial Plan? | The Green Deal Industrial Plan is a bid by the EU to make its net zero industry more competitive... | The EU’s Green Deal Industrial Plan aims to enhance the competitiveness of its net zero industry... | sgvoice.net/policy/25396/eu-seeks-competitive-boost-with-green-deal-industrial-plan |
| 3 | 4 | What are the four focus areas of the EU's Green Deal Industrial Plan? | The new plan is fundamentally focused on four areas, or pillars: the regulatory environment, acc... | The four focus areas of the EU's Green Deal Industrial Plan are the regulatory environment, acce... | sgvoice.net/policy/25396/eu-seeks-competitive-boost-with-green-deal-industrial-plan |
| 22 | 23 | Which has the higher absorption coefficient for wavelengths above 500m - amorphous germanium or ... | We chose amorphous germanium instead of amorphous silicon as absorber material because of its hi... | amorphous germanium | pv-magazine.com/2021/01/15/germanium-based-solar-cell-tech-for-agrivoltaics/#respond |
We also know from previous analysis that some duplicate articles from the "energyvoice" domain so we will also normalize these URLs.
missing_articles["url"] = missing_articles["url"].map(lambda x: x.replace("sgvoice.net", "sgvoice.energyvoice.com"))
missing_articles[~missing_articles["url"].isin(articles_df["url"])]
| example_id | question | relevant_section | answer | url | |
|---|---|---|---|---|---|
| 22 | 23 | Which has the higher absorption coefficient for wavelengths above 500m - amorphous germanium or ... | We chose amorphous germanium instead of amorphous silicon as absorber material because of its hi... | amorphous germanium | pv-magazine.com/2021/01/15/germanium-based-solar-cell-tech-for-agrivoltaics/#respond |
human_eval_df.loc[missing_articles.index, "url"] = missing_articles["url"]
human_eval_df[human_eval_df["url"].isin(articles_df["url"])]
| example_id | question | relevant_section | answer | url | |
|---|---|---|---|---|---|
| 0 | 1 | What is the innovation behind Leclanché's new method to produce lithium-ion batteries? | Leclanché said it has developed an environmentally friendly way to produce lithium-ion (Li-ion) ... | Leclanché's innovation is using a water-based process instead of highly toxic organic solvents t... | sgvoice.energyvoice.com/strategy/technology/23971/leclanches-new-disruptive-battery-boosts-energ... |
| 1 | 2 | What is the EU’s Green Deal Industrial Plan? | The Green Deal Industrial Plan is a bid by the EU to make its net zero industry more competitive... | The EU’s Green Deal Industrial Plan aims to enhance the competitiveness of its net zero industry... | sgvoice.energyvoice.com/policy/25396/eu-seeks-competitive-boost-with-green-deal-industrial-plan |
| 2 | 3 | What is the EU’s Green Deal Industrial Plan? | The European counterpart to the US Inflation Reduction Act (IRA) aims to create an environment t... | The EU’s Green Deal Industrial Plan aims to enhance the competitiveness of its net zero industry... | pv-magazine.com/2023/02/02/european-commission-introduces-green-deal-industrial-plan |
| 3 | 4 | What are the four focus areas of the EU's Green Deal Industrial Plan? | The new plan is fundamentally focused on four areas, or pillars: the regulatory environment, acc... | The four focus areas of the EU's Green Deal Industrial Plan are the regulatory environment, acce... | sgvoice.energyvoice.com/policy/25396/eu-seeks-competitive-boost-with-green-deal-industrial-plan |
| 4 | 5 | When did the cooperation between GM and Honda on fuel cell vehicles start? | What caught our eye was a new hookup between GM and Honda. Honda was also hammering away at the ... | July 2013 | cleantechnica.com/2023/05/08/general-motors-seizes-the-fuel-cell-moment-with-green-hydrogen |
| 5 | 6 | Did Colgate-Palmolive enter into PPA agreements with solar developers? | Scout Clean Energy, a Colorado-based renewable energy developer, owner and operator, has signed ... | yes | solarindustrymag.com/scout-and-colgate-palmolive-sign-ppa-for-texas-solar-farm |
| 6 | 7 | What is the status of ZeroAvia's hydrogen fuel cell electric aircraft? | In December, the US startup ZeroAvia announced that its retrofitted 19-seat Dornier 228 hydrogen... | ZeroAvia's hydrogen fuel cell electric aircraft, a retrofitted 19-seat Dornier 228, has received... | cleantechnica.com/2023/01/02/the-wait-for-hydrogen-fuel-cell-electric-aircraft-just-got-shorter/... |
| 7 | 8 | What is the "Danger Season"? | As spring turns to summer and the days warm up, the Northern Hemisphere enters the period known ... | The "Danger Season" is the period in the Northern Hemisphere, beginning in late spring, when wil... | cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate |
| 8 | 9 | Is Mississipi an anti-ESG state? | Mississippi is among two dozen or so states in which Republican governors, legislators, treasure... | yes | cleantechnica.com/2023/05/15/mississippi-takes-green-hydrogen-to-next-level/#zox-comments-button |
| 9 | 10 | Can you hang solar panels on garden fences? | Scaling down from the farm to the garden level, another company is now offering plug & play sola... | yes | cleantechnica.com/2023/05/18/solar-panels-for-garden-fences-plug-play-solar-gone-wild |
| 10 | 11 | Who develops quality control systems for ocean temperature in-situ profiles? | Scientists from the Chinese Academy of Sciences’ (CAS) Institute of Atmospheric Physics (IAP) an... | Scientists from the Chinese Academy of Sciences' Institute of Atmospheric Physics (IAP) | azocleantech.com/news.aspx?newsID=32873 |
| 11 | 12 | Why are milder winters detrimental for grapes and apples? | Since grapes and apples are perennial species, they have adapted to consistent climate patterns ... | Milder winters are detrimental for grapes and apples because these perennial species rely on con... | azocleantech.com/news.aspx?newsID=33040 |
| 12 | 13 | What are the basic recycling steps for solar panels? | There are some simple recycling steps that can be taken to reduce the waste volume, including re... | removing the frames, glass covers, and solar connectors | azocleantech.com/news.aspx?newsID=33143 |
| 13 | 14 | Why does melting ice contribute to global warming? | Whereas white ice reflects the sun's rays, a dark sea absorbs over ten times as much solar energ... | Melting ice contributes to global warming because white ice reflects the sun's rays, while the d... | azocleantech.com/news.aspx?newsID=33149 |
| 14 | 15 | Does the Swedish government plan bans on new petrol and diesel cars? | The Swedish government has proposed a ban on new petrol and diesel cars from 2030 to reduce carb... | yes | azocleantech.com/news.aspx?newsID=33174 |
| 15 | 16 | Where do the turbines used in Icelandic geothermal power plants come from? | Minister Nishimura mentioned that most geothermal power plants in Iceland use turbines made by J... | Japan | thinkgeoenergy.com/japan-and-iceland-agree-on-geothermal-energy-cooperation |
| 16 | 17 | Who is the target user for Leapfrog Energy? | O’Brien added, “Subsurface specialists need flexible and fast tools like Leapfrog Energy to unde... | subsurface specialists | thinkgeoenergy.com/seequent-expands-subsurface-capabilities-with-leapfrog-energy |
| 17 | 18 | What is Agrivoltaics? | Agrivoltaics, the integration of food production and solar energy, is an emerging technology tha... | the integration of food production and solar energy to make better use of limited land and soil ... | pv-magazine.com/2023/03/31/new-software-modeling-tool-for-agrivoltaics/#comments |
| 18 | 19 | What is Agrivoltaics? | Agrivoltaics refers to the conduct of agricultural activity within a solar array. A relatively n... | the integration of food production and solar energy to make better use of limited land and soil ... | cleantechnica.com/2022/12/18/agrivoltaics-goes-nuclear-on-california-prairie |
| 19 | 20 | Why is cannabis cultivation moving indoors? | Cannabis cultivation can take place outdoors, indoors, or in greenhouses. While outdoor cultivat... | to meet the demand for higher-quality products, control environmental factors and flowering peri... | pv-magazine.com/2023/04/08/high-time-for-solar/#comments |
| 20 | 21 | What are the obstacles for cannabis producers when it comes to using solar energy? | “There are a lot of prevailing headwinds for cannabis to adopt more solar,” says Mochulsky. “Acc... | limited access to financial instruments, inability to secure standard loans or mortgages, lack o... | pv-magazine.com/2023/04/08/high-time-for-solar/#comments |
| 21 | 22 | In 2021, what were the top 3 states in the US in terms of total solar power generating capacity? | In 2021, Florida surpassed North Carolina to become third in the nation in total solar power gen... | California, Texas, and Florida | cleantechnica.com/2023/04/10/solar-power-in-florida |
In the end we are able to find all the articles that are linked to the evaluation data and have therefore successfully completed our exploratory data analysis and preprocessing.
Our Generated Evaluation QA pairs¶
human_eval_df_media.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 200 non-null int64 1 context 200 non-null object 2 question 200 non-null object 3 answer 200 non-null object 4 source_doc 200 non-null object 5 category 200 non-null object 6 groundedness_score 97 non-null float64 7 groundedness_eval 97 non-null object 8 relevance_score 97 non-null float64 9 relevance_eval 97 non-null object 10 standalone_score 97 non-null float64 11 standalone_eval 97 non-null object dtypes: float64(3), int64(1), object(8) memory usage: 18.9+ KB
human_eval_df_patent.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 200 non-null int64 1 context 200 non-null object 2 question 200 non-null object 3 answer 200 non-null object 4 category 200 non-null object 5 groundedness_score 33 non-null float64 6 groundedness_eval 33 non-null object 7 relevance_score 33 non-null float64 8 relevance_eval 33 non-null object 9 standalone_score 33 non-null float64 10 standalone_eval 33 non-null object 11 title 200 non-null object dtypes: float64(3), int64(1), object(8) memory usage: 18.9+ KB
Curated media QA Pairs preprocessing¶
performing the same preprocessing steps as for the seed QA dataset
human_eval_df_media.rename(columns={"context":"relevant_section","source_doc": "url","Unnamed: 0": "id"}, inplace=True)
human_eval_df_media.head()
| id | relevant_section | question | answer | url | category | groundedness_score | groundedness_eval | relevance_score | relevance_eval | standalone_score | standalone_eval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What is green hydrogen, and how is it produced? | Green hydrogen is a sustainable energy carrier produced by water electrolysis using renewable en... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The provided context clearly explains what green hydrogen is, how it is produced through water e... | 3.0 | The question is straightforward and clear, inquiring about a specific concept (green hydrogen) a... | 5.0 | The question is self-explanatory and does not rely on additional context to be understood. It is... |
| 1 | 1 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What is the significance of the Neom project in Saudi Arabia as a pioneering example of green hy... | The Neom project, a partnership between ACWA Power, Air Products, and NEOM, harnesses solar and ... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The context provides a comprehensive overview of the significance of green hydrogen and its inte... | 4.0 | The Neom project is a key concept in the context of green hydrogen integration, and understandin... | 4.0 | The question appears to rely on additional knowledge about the Neom project and its connection t... |
| 2 | 2 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What is the significance of the Energy Transitions Commission's report on making clean electrifi... | The report highlights the need for a 30-year transition to electrify the global economy, providi... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The question regarding the significance of the Energy Transitions Commission's report "Making Cl... | 3.0 | This question appears to be relevant to environmental sustainability and energy policy, which mi... | 5.0 | The question refers to a specific institution (Energy Transitions Commission) and a specific con... |
| 3 | 3 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | How does green hydrogen compare to direct use of electricity in terms of energy efficiency? | Green hydrogen production through electrolysis is less energy-efficient than direct use of elect... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 4.0 | The question requires an in-depth analysis of the context provided, specifically focusing on the... | 4.0 | This question is relevant to NLP developers building applications that may use energy-intensive ... | 5.0 | The question implies that there might be some universal or general information about green hydro... |
| 4 | 4 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What are some of the examples of pilot projects testing the viability of green hydrogen in vario... | Various pilot projects are testing green hydrogen's viability in energy systems, demonstrating i... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The question can be answered unambiguously based on the provided context, which describes variou... | 3.0 | This question appears to be focused on environmental sustainability and energy systems, which is... | 4.0 | The question asks for specific examples of pilot projects, which implies the existence of a cont... |
sns.histplot(human_eval_df_media["question"].map(len), kde=True)
plt.title("Question Character Length Distribution")
plt.xlabel("Character Length")
plt.ylabel("Count")
mean_char_len = human_eval_df_media["question"].map(len).mean()
plt.axvline(mean_char_len, color='r', linestyle='--', label=f"Mean character amount: {mean_char_len:.2f}")
plt.legend()
plt.show()
missing_articles = human_eval_df_media.copy()
missing_articles = missing_articles[~human_eval_df_media["url"].isin(articles_df["url"])]
missing_articles
| id | relevant_section | question | answer | url | category | groundedness_score | groundedness_eval | relevance_score | relevance_eval | standalone_score | standalone_eval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What is green hydrogen, and how is it produced? | Green hydrogen is a sustainable energy carrier produced by water electrolysis using renewable en... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The provided context clearly explains what green hydrogen is, how it is produced through water e... | 3.0 | The question is straightforward and clear, inquiring about a specific concept (green hydrogen) a... | 5.0 | The question is self-explanatory and does not rely on additional context to be understood. It is... |
| 1 | 1 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What is the significance of the Neom project in Saudi Arabia as a pioneering example of green hy... | The Neom project, a partnership between ACWA Power, Air Products, and NEOM, harnesses solar and ... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The context provides a comprehensive overview of the significance of green hydrogen and its inte... | 4.0 | The Neom project is a key concept in the context of green hydrogen integration, and understandin... | 4.0 | The question appears to rely on additional knowledge about the Neom project and its connection t... |
| 2 | 2 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What is the significance of the Energy Transitions Commission's report on making clean electrifi... | The report highlights the need for a 30-year transition to electrify the global economy, providi... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The question regarding the significance of the Energy Transitions Commission's report "Making Cl... | 3.0 | This question appears to be relevant to environmental sustainability and energy policy, which mi... | 5.0 | The question refers to a specific institution (Energy Transitions Commission) and a specific con... |
| 3 | 3 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | How does green hydrogen compare to direct use of electricity in terms of energy efficiency? | Green hydrogen production through electrolysis is less energy-efficient than direct use of elect... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 4.0 | The question requires an in-depth analysis of the context provided, specifically focusing on the... | 4.0 | This question is relevant to NLP developers building applications that may use energy-intensive ... | 5.0 | The question implies that there might be some universal or general information about green hydro... |
| 4 | 4 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What are some of the examples of pilot projects testing the viability of green hydrogen in vario... | Various pilot projects are testing green hydrogen's viability in energy systems, demonstrating i... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The question can be answered unambiguously based on the provided context, which describes variou... | 3.0 | This question appears to be focused on environmental sustainability and energy systems, which is... | 4.0 | The question asks for specific examples of pilot projects, which implies the existence of a cont... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 195 | 195 | India’ s installed battery storage capacity reached 219.1 MWh at the end of March 2024. A recent... | Who is the author of the Mercom report on India's energy storage landscape? | The report is authored by Mercom. | https://www.pv-magazine.com/2024/07/10/indias-battery-storage-capacity-hits-219-1-mwh/ | Sustainability & Technological Innovation Questions | NaN | NaN | NaN | NaN | NaN | NaN |
| 196 | 196 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What are the main challenges associated with integrating green hydrogen into the energy grid? | The main challenges include technical hurdles such as lower energy efficiency, safe storage and ... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | NaN | NaN | NaN | NaN | NaN | NaN |
| 197 | 197 | Shortly after Tesla CEO Elon Musk met with India Prime Minister Narendra Modi, news is out that ... | How does the Indian government plan to achieve its electric vehicle adoption goals? | The Indian government aims to achieve its electric vehicle adoption goals through a combination ... | https://cleantechnica.com/2023/07/13/tesla-plans-to-manufacture-a-24000-car-in-india/ | Government & Corporate Initiatives | NaN | NaN | NaN | NaN | NaN | NaN |
| 198 | 198 | NV Energy, Nevada’ s largest public utility, has awarded Energy Vault Holdings Inc. with a proje... | What is Energy Vault's goal for short-duration energy storage solutions? | To be the energy storage company of choice for utilities, IPPs, and large energy users. | https://solarindustrymag.com/energy-vault-deploys-440-mwh-nevada-energy-storage-system-for-nv-en... | Sustainability & Technological Innovation Questions | NaN | NaN | NaN | NaN | NaN | NaN |
| 199 | 199 | The Japanese technology company Asahi Kasei is further accelerating its hydrogen business activi... | What is the name of the company discussed in the interview? | Various companies, including Cleantech for UK under Sarah Mackintosh's leadership, are discussed... | https://www.azocleantech.com/news.aspx?newsID=34871 | Government & Corporate Initiatives | NaN | NaN | NaN | NaN | NaN | NaN |
200 rows × 12 columns
articles_df["url"] = articles_df["url"].map(normalize_url)
human_eval_df_media["url"] = human_eval_df_media["url"].map(normalize_url)
missing_articles = human_eval_df_media.copy()
missing_articles = missing_articles[~human_eval_df_media["url"].isin(articles_df["url"])]
missing_articles
| id | relevant_section | question | answer | url | category | groundedness_score | groundedness_eval | relevance_score | relevance_eval | standalone_score | standalone_eval |
|---|
missing_articles["url"] = missing_articles["url"].map(lambda x: x.replace("sgvoice.net", "sgvoice.energyvoice.com"))
missing_articles[~missing_articles["url"].isin(articles_df["url"])]
| id | relevant_section | question | answer | url | category | groundedness_score | groundedness_eval | relevance_score | relevance_eval | standalone_score | standalone_eval |
|---|
After normalizing the URLs there are no missing articles . Now URLS match in media original Dataset and media QA dataset.
human_eval_df_media.loc[missing_articles.index, "url"] = missing_articles["url"]
human_eval_df_media[human_eval_df_media["url"].isin(articles_df["url"])]
| id | relevant_section | question | answer | url | category | groundedness_score | groundedness_eval | relevance_score | relevance_eval | standalone_score | standalone_eval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What is green hydrogen, and how is it produced? | Green hydrogen is a sustainable energy carrier produced by water electrolysis using renewable en... | azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The provided context clearly explains what green hydrogen is, how it is produced through water e... | 3.0 | The question is straightforward and clear, inquiring about a specific concept (green hydrogen) a... | 5.0 | The question is self-explanatory and does not rely on additional context to be understood. It is... |
| 1 | 1 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What is the significance of the Neom project in Saudi Arabia as a pioneering example of green hy... | The Neom project, a partnership between ACWA Power, Air Products, and NEOM, harnesses solar and ... | azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The context provides a comprehensive overview of the significance of green hydrogen and its inte... | 4.0 | The Neom project is a key concept in the context of green hydrogen integration, and understandin... | 4.0 | The question appears to rely on additional knowledge about the Neom project and its connection t... |
| 2 | 2 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What is the significance of the Energy Transitions Commission's report on making clean electrifi... | The report highlights the need for a 30-year transition to electrify the global economy, providi... | azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The question regarding the significance of the Energy Transitions Commission's report "Making Cl... | 3.0 | This question appears to be relevant to environmental sustainability and energy policy, which mi... | 5.0 | The question refers to a specific institution (Energy Transitions Commission) and a specific con... |
| 3 | 3 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | How does green hydrogen compare to direct use of electricity in terms of energy efficiency? | Green hydrogen production through electrolysis is less energy-efficient than direct use of elect... | azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 4.0 | The question requires an in-depth analysis of the context provided, specifically focusing on the... | 4.0 | This question is relevant to NLP developers building applications that may use energy-intensive ... | 5.0 | The question implies that there might be some universal or general information about green hydro... |
| 4 | 4 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What are some of the examples of pilot projects testing the viability of green hydrogen in vario... | Various pilot projects are testing green hydrogen's viability in energy systems, demonstrating i... | azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The question can be answered unambiguously based on the provided context, which describes variou... | 3.0 | This question appears to be focused on environmental sustainability and energy systems, which is... | 4.0 | The question asks for specific examples of pilot projects, which implies the existence of a cont... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 195 | 195 | India’ s installed battery storage capacity reached 219.1 MWh at the end of March 2024. A recent... | Who is the author of the Mercom report on India's energy storage landscape? | The report is authored by Mercom. | pv-magazine.com/2024/07/10/indias-battery-storage-capacity-hits-219-1-mwh | Sustainability & Technological Innovation Questions | NaN | NaN | NaN | NaN | NaN | NaN |
| 196 | 196 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What are the main challenges associated with integrating green hydrogen into the energy grid? | The main challenges include technical hurdles such as lower energy efficiency, safe storage and ... | azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | NaN | NaN | NaN | NaN | NaN | NaN |
| 197 | 197 | Shortly after Tesla CEO Elon Musk met with India Prime Minister Narendra Modi, news is out that ... | How does the Indian government plan to achieve its electric vehicle adoption goals? | The Indian government aims to achieve its electric vehicle adoption goals through a combination ... | cleantechnica.com/2023/07/13/tesla-plans-to-manufacture-a-24000-car-in-india | Government & Corporate Initiatives | NaN | NaN | NaN | NaN | NaN | NaN |
| 198 | 198 | NV Energy, Nevada’ s largest public utility, has awarded Energy Vault Holdings Inc. with a proje... | What is Energy Vault's goal for short-duration energy storage solutions? | To be the energy storage company of choice for utilities, IPPs, and large energy users. | solarindustrymag.com/energy-vault-deploys-440-mwh-nevada-energy-storage-system-for-nv-energy | Sustainability & Technological Innovation Questions | NaN | NaN | NaN | NaN | NaN | NaN |
| 199 | 199 | The Japanese technology company Asahi Kasei is further accelerating its hydrogen business activi... | What is the name of the company discussed in the interview? | Various companies, including Cleantech for UK under Sarah Mackintosh's leadership, are discussed... | azocleantech.com/news.aspx?newsID=34871 | Government & Corporate Initiatives | NaN | NaN | NaN | NaN | NaN | NaN |
200 rows × 12 columns
media QA Pairs¶
missing_articles["url"] = missing_articles["url"].map(lambda x: x.replace("sgvoice.net", "sgvoice.energyvoice.com"))
missing_articles[~missing_articles["url"].isin(articles_df["url"])]
| id | relevant_section | question | answer | url | category | groundedness_score | groundedness_eval | relevance_score | relevance_eval | standalone_score | standalone_eval |
|---|
human_eval_df_patent.rename(columns={"context":"relevant_section"}, inplace=True)
human_eval_df_patent.head()
| id | relevant_section | question | answer | category | groundedness_score | groundedness_eval | relevance_score | relevance_eval | standalone_score | standalone_eval | title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Distributed photovoltaic energy storage refrigeration house systemThe utility model discloses a ... | How does the system reduce the cost of cold storage? | The system reduces the cost of cold storage by converting solar energy into electric energy, whi... | Sustainability & Technological Innovation Questions | 5.0 | The question can be answered based on the given context, and the answer is clear and unambiguous. | 4.0 | The question is concise and to the point, directly asking about a specific aspect of how the Hug... | 5.0 | The question appears to be related to a general concept of cost reduction in the context of data... | Distributed photovoltaic energy storage refrigeration house system |
| 1 | 1 | Water path manifold structure of hydrogen energy automobile electric pileThe utility model discl... | What is good about the utility model? | The utility model has a simple structure that is easy to assemble and disassemble. | Analytical & Explanatory Questions | 5.0 | The question "What is good about the utility model?" is somewhat ambiguous without further clari... | 3.0 | The question is very short and to the point, but it lacks context and detail about the specific ... | 5.0 | The question seems to be asking about a general property or characteristic of a "utility model",... | Water path manifold structure of hydrogen energy automobile electric pile |
| 2 | 2 | Active power control method of water-fire-wind-solar energy storage multi-energy complementary i... | What is the purpose of using power supply with better regulation performance to compensate for p... | The power supply with better regulation performance is used to carry out compensation regulation... | Government & Corporate Initiatives | 5.0 | The question is clearly answerable by understanding the purpose of using power supply with bette... | 3.0 | The question is directly related to power supply regulation and its impact on the performance of... | 5.0 | The question assumes knowledge of power supplies in general, specifically their regulation perfo... | Active power control method of water-fire-wind-solar energy storage multi-energy complementary i... |
| 3 | 3 | Water conservancy and hydropower engineering construction tunnel internal flow guiding and drain... | What is the water conservancy and hydropower engineering construction hole inner diversion drain... | The utility model discloses a water conservancy and hydropower engineering construction hole inn... | Sustainability & Technological Innovation Questions | 5.0 | The question is clearly answerable with the given context, as it describes a specific inner dive... | 3.0 | The question seems to be about a specific technical term, which may be useful for machine learni... | 5.0 | The question contains technical terms and a specific reference to a concept that appears to be w... | Water conservancy and hydropower engineering construction tunnel internal flow guiding and drain... |
| 4 | 4 | Medium-and-long-term electric power quantity balancing method for electric power system containi... | What is the main consideration for the balancing method, in addition to safety and economy? | The seasonal characteristics of renewable energy sources in time and the coordination problem of... | Analytical & Explanatory Questions | 4.0 | The context provides a detailed description of a method for balancing electric quantity in a pow... | 4.0 | The question is asking about a specific aspect of the balancing method, which is a common techni... | 5.0 | The question does not provide a specific context, and the balancing method is a general concept ... | Medium-and-long-term electric power quantity balancing method for electric power system containi... |
sns.histplot(human_eval_df_patent["question"].map(len), kde=True)
plt.title("Question Character Length Distribution")
plt.xlabel("Character Length")
plt.ylabel("Count")
mean_char_len = human_eval_df_patent["question"].map(len).mean()
plt.axvline(mean_char_len, color='r', linestyle='--', label=f"Mean character amount: {mean_char_len:.2f}")
plt.legend()
plt.show()
Subsampling¶
For faster processing and to reduce the cost of running the notebook we will subsample the dataset to 1000 articles. This will allow us to run the notebook in a reasonable amount of time and still provide meaningful results. Because the distribution of articles across publishers is skewed we will use stratified sampling to ensure that we have a representative sample. We also need to keep in mind that the evaluation data are linked to specific articles so we need to make sure that these are included in the subsample.
eval_articles_df = articles_df[articles_df["url"].isin(human_eval_df["url"])]
eval_articles_df.head()
| title | content | domain | url | article | lang | readability | |
|---|---|---|---|---|---|---|---|
| 93950 | Agrivoltaics Goes Nuclear On California Prairie | ['A decommissioned nuclear power plant from the 1980s is repurposed for agrivoltaics and prairie... | cleantechnica | cleantechnica.com/2022/12/18/agrivoltaics-goes-nuclear-on-california-prairie | A decommissioned nuclear power plant from the 1980s is repurposed for agrivoltaics and prairie r... | en | 42.00 |
| 93986 | The Wait For Hydrogen Fuel Cell Electric Aircraft Just Got Shorter | ['The US firm ZeroAvia is one step closer to bringing its zero emission electric aircraft to mar... | cleantechnica | cleantechnica.com/2023/01/02/the-wait-for-hydrogen-fuel-cell-electric-aircraft-just-got-shorter/... | The US firm ZeroAvia is one step closer to bringing its zero emission electric aircraft to marke... | en | 50.46 |
| 43308 | Leclanché’ s new disruptive battery boosts energy density | ['Energy storage company Leclanché ( SW.LECN) has designed a new battery cell that uses less cob... | energyvoice | sgvoice.energyvoice.com/strategy/technology/23971/leclanches-new-disruptive-battery-boosts-energ... | Energy storage company Leclanché ( SW.LECN) has designed a new battery cell that uses less cobal... | en | 43.22 |
| 21630 | Quality Control System for Ocean Temperature In-Situ Profiles | ["By clicking `` Allow All '' you agree to the storing of cookies on your device to enhance site... | azocleantech | azocleantech.com/news.aspx?newsID=32873 | Over the last century, over 16 million ocean temperature profiles have been acquired. However, ... | en | 37.91 |
| 98795 | European Commission introduces Green Deal Industrial Plan – pv magazine International | ['The European Commission listed tax exemptions, flexible aid, and the promotion of local manufa... | pv-magazine | pv-magazine.com/2023/02/02/european-commission-introduces-green-deal-industrial-plan | The European Commission listed tax exemptions, flexible aid, and the promotion of local manufact... | en | 42.72 |
print(eval_articles_df["url"].unique().shape)
print(human_eval_df["url"].unique().shape)
(20,) (21,)
def do_stratification(
df: pd.DataFrame,
column: str,
sample_size: int,
seed: int = 42
) -> pd.DataFrame:
res_df = df.copy()
indx = df.groupby(column, group_keys=False)[column].apply(lambda x: x.sample(n=int(sample_size/len(df) * len(x)), random_state=seed)).index.to_list()
return res_df.loc[indx]
sample_df = do_stratification(articles_df, "domain", 1000, 69)
# if the articles are already in the subsample from the evaluation set, then we remove them, so we just want unique urls
sample_df = sample_df[~sample_df["url"].isin(eval_articles_df["url"])]
sample_df = pd.concat([sample_df, eval_articles_df])
sample_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 1004 entries, 17313 to 63679 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 1004 non-null object 1 content 1004 non-null object 2 domain 1004 non-null object 3 url 1004 non-null object 4 article 1004 non-null object 5 lang 1004 non-null object 6 readability 1004 non-null float64 dtypes: float64(1), object(6) memory usage: 62.8+ KB
domain_counts = sample_df['domain'].value_counts()
print(domain_counts)
domain energy-xprt 207 pv-magazine 155 azocleantech 128 cleantechnica 110 pv-tech 97 thinkgeoenergy 53 energyvoice 43 solarpowerportal.co 42 solarpowerworldonline 39 solarindustrymag 31 solarquarter 30 rechargenews 28 naturalgasintel 14 iea 8 energyintel 8 greenprophet 6 greenairnews 2 ecofriend 2 all-energy 1 Name: count, dtype: int64
To make sure that the distributional characteristics has not been changed by subsampling we visualize and compare both data sets in relative terms.
original_domain_counts = articles_df["domain"].value_counts().to_frame()
original_domain_counts = original_domain_counts / original_domain_counts.sum() * 100
domain_counts_df = original_domain_counts.copy()
domain_counts_df["type"] = "Original"
sample_domain_counts = sample_df["domain"].value_counts().to_frame()
sample_domain_counts = sample_domain_counts / sample_domain_counts.sum() * 100
sample_domain_counts["type"] = "Sample"
domain_counts_df = pd.concat([domain_counts_df, sample_domain_counts])
sns.barplot(
x=domain_counts_df.index,
y=domain_counts_df["count"],
hue=domain_counts_df["type"]
)
plt.title("Domain Distribution")
plt.xlabel("Domain")
plt.ylabel("Percentage")
plt.xticks(rotation=90)
plt.show()
Now all is prepared to start developing our RAG!
For Media QA pairs datset¶
eval_articles_df_media = articles_df[articles_df["url"].isin(human_eval_df_media["url"])]
eval_articles_df_media.head()
| title | content | domain | url | article | lang | readability | |
|---|---|---|---|---|---|---|---|
| 63200 | NREL Project Investigates Wind Condition Impacts on Solar Power Structures | ['High wind loads increase structural design costs of concentrating solar power ( CSP) collector... | solarindustrymag | solarindustrymag.com/nrel-project-investigates-wind-condition-impacts-on-solar-power-structures#... | High wind loads increase structural design costs of concentrating solar power ( CSP) collector s... | en | 30.09 |
| 42863 | Patrick Harvie: Ukraine invasion doesn't justify North Sea production boost | ['Scottish Green party co-leader Patrick Harvie has said the war in Ukraine must not be used to ... | energyvoice | energyvoice.com/oilandgas/europe/395102/north-sea-patrick-harvie | Scottish Green party co-leader Patrick Harvie has said the war in Ukraine must not be used to ju... | en | 52.12 |
| 98356 | South Korea tests photovoltaics on railroad noise barriers – pv magazine International | ['Land-scarce South Korea is currently hosting a series of initiatives aimed at deploying solar ... | pv-magazine | pv-magazine.com/2022/04/21/south-korea-tests-photovoltaics-on-railroad-noise-barriers | Land-scarce South Korea is currently hosting a series of initiatives aimed at deploying solar on... | en | 52.09 |
| 98368 | Quebec publishes draft documents for 1.3 GW tender – pv magazine International | ['The Canadian provincial government’ s Green Economy Plan, launched in November 2020, envisages... | pv-magazine | pv-magazine.com/2022/04/27/quebec-publishes-draft-documents-for-1-3-gw-tender | The Canadian provincial government’ s Green Economy Plan, launched in November 2020, envisages a... | en | 40.69 |
| 63288 | New SEIA Nonprofit Serves to Advance Solar Industry Research, Policies | ['The Solar Energy Industries Association ( SEIA) is launching a 501 ( c) 3 nonprofit to acceler... | solarindustrymag | solarindustrymag.com/new-seia-nonprofit-serves-to-advance-solar-industry-research-policies | The Solar Energy Industries Association ( SEIA) is launching a 501 ( c) 3 nonprofit to accelerat... | en | 38.05 |
print(eval_articles_df_media["url"].unique().shape)
print(human_eval_df_media["url"].unique().shape)
(48,) (48,)
Creating common embeddings for the given QA evaluation dataset and Our Curated dataset
sample_df_media = sample_df.copy()
# if the articles are already in the subsample from the evaluation set, then we remove them, so we just want unique urls
sample_df_media = sample_df_media[~sample_df_media["url"].isin(eval_articles_df_media["url"])]
sample_df_media = pd.concat([sample_df_media, eval_articles_df_media])
sample_df_media.info()
<class 'pandas.core.frame.DataFrame'> Index: 1051 entries, 17313 to 23380 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 1051 non-null object 1 content 1051 non-null object 2 domain 1051 non-null object 3 url 1051 non-null object 4 article 1051 non-null object 5 lang 1051 non-null object 6 readability 1051 non-null float64 dtypes: float64(1), object(6) memory usage: 65.7+ KB
domain_counts = sample_df_media['domain'].value_counts()
print(domain_counts)
domain energy-xprt 207 pv-magazine 164 azocleantech 135 cleantechnica 125 pv-tech 97 thinkgeoenergy 53 energyvoice 52 solarpowerportal.co 42 solarpowerworldonline 39 solarindustrymag 38 solarquarter 30 rechargenews 28 naturalgasintel 14 iea 8 energyintel 8 greenprophet 6 greenairnews 2 ecofriend 2 all-energy 1 Name: count, dtype: int64
original_domain_counts = articles_df["domain"].value_counts().to_frame()
original_domain_counts = original_domain_counts / original_domain_counts.sum() * 100
domain_counts_df = original_domain_counts.copy()
domain_counts_df["type"] = "Original"
sample_domain_counts = sample_df_media["domain"].value_counts().to_frame()
sample_domain_counts = sample_domain_counts / sample_domain_counts.sum() * 100
sample_domain_counts["type"] = "Sample"
domain_counts_df = pd.concat([domain_counts_df, sample_domain_counts])
sns.barplot(
x=domain_counts_df.index,
y=domain_counts_df["count"],
hue=domain_counts_df["type"]
)
plt.title("Domain Distribution")
plt.xlabel("Domain")
plt.ylabel("Percentage")
plt.xticks(rotation=90)
plt.show()
Above in order to have only 1 embedding type for seed QA dataset and our generated QA dataset. we combined the articles from both of them to subsampled dataset. Thats why there is a bit mismatch in original and subsampled dataset distribution.
For Generated Patent QA Dataset¶
human_eval_df_patent.head()
| id | relevant_section | question | answer | category | groundedness_score | groundedness_eval | relevance_score | relevance_eval | standalone_score | standalone_eval | title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Distributed photovoltaic energy storage refrigeration house systemThe utility model discloses a ... | How does the system reduce the cost of cold storage? | The system reduces the cost of cold storage by converting solar energy into electric energy, whi... | Sustainability & Technological Innovation Questions | 5.0 | The question can be answered based on the given context, and the answer is clear and unambiguous. | 4.0 | The question is concise and to the point, directly asking about a specific aspect of how the Hug... | 5.0 | The question appears to be related to a general concept of cost reduction in the context of data... | Distributed photovoltaic energy storage refrigeration house system |
| 1 | 1 | Water path manifold structure of hydrogen energy automobile electric pileThe utility model discl... | What is good about the utility model? | The utility model has a simple structure that is easy to assemble and disassemble. | Analytical & Explanatory Questions | 5.0 | The question "What is good about the utility model?" is somewhat ambiguous without further clari... | 3.0 | The question is very short and to the point, but it lacks context and detail about the specific ... | 5.0 | The question seems to be asking about a general property or characteristic of a "utility model",... | Water path manifold structure of hydrogen energy automobile electric pile |
| 2 | 2 | Active power control method of water-fire-wind-solar energy storage multi-energy complementary i... | What is the purpose of using power supply with better regulation performance to compensate for p... | The power supply with better regulation performance is used to carry out compensation regulation... | Government & Corporate Initiatives | 5.0 | The question is clearly answerable by understanding the purpose of using power supply with bette... | 3.0 | The question is directly related to power supply regulation and its impact on the performance of... | 5.0 | The question assumes knowledge of power supplies in general, specifically their regulation perfo... | Active power control method of water-fire-wind-solar energy storage multi-energy complementary i... |
| 3 | 3 | Water conservancy and hydropower engineering construction tunnel internal flow guiding and drain... | What is the water conservancy and hydropower engineering construction hole inner diversion drain... | The utility model discloses a water conservancy and hydropower engineering construction hole inn... | Sustainability & Technological Innovation Questions | 5.0 | The question is clearly answerable with the given context, as it describes a specific inner dive... | 3.0 | The question seems to be about a specific technical term, which may be useful for machine learni... | 5.0 | The question contains technical terms and a specific reference to a concept that appears to be w... | Water conservancy and hydropower engineering construction tunnel internal flow guiding and drain... |
| 4 | 4 | Medium-and-long-term electric power quantity balancing method for electric power system containi... | What is the main consideration for the balancing method, in addition to safety and economy? | The seasonal characteristics of renewable energy sources in time and the coordination problem of... | Analytical & Explanatory Questions | 4.0 | The context provides a detailed description of a method for balancing electric quantity in a pow... | 4.0 | The question is asking about a specific aspect of the balancing method, which is a common techni... | 5.0 | The question does not provide a specific context, and the balancing method is a general concept ... | Medium-and-long-term electric power quantity balancing method for electric power system containi... |
In patent dataset there are same titles and but separate abstracts and vice versa, so to get the unique data row we compare title and abstract both at the same time.
eval_abstracts_df_patent = patents_df[
patents_df["title"].str.lower().isin(human_eval_df_patent["title"].str.lower()) |
patents_df["abstract"].str.lower().isin(human_eval_df_patent["relevant_section"].str.lower())
]
eval_abstracts_df_patent.head()
| publication_number | application_number | country_code | title | abstract | title_lang | abstract_lang | text | cleaned_text | topic | readability | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 9045 | CN-114558462-B | CN-202210285989-A | CN | Preparation method and application method of photothermal conversion fiber membrane | The invention belongs to the technical field of sea water desalination, and particularly relates... | en | en | Preparation method and application method of photothermal conversion fiber membrane. The inventi... | preparation method application method photothermal conversion fiber membrane invention belongs t... | 6 | -20.90 |
| 12854 | CN-218580607-U | CN-202222107350-U | CN | Water conservancy and hydropower engineering construction tunnel internal flow guiding and drain... | The utility model relates to a hydraulic and hydroelectric engineering field, concretely relates... | en | en | Water conservancy and hydropower engineering construction tunnel internal flow guiding and drain... | water conservancy hydropower engineering construction tunnel internal flow guiding drainage stru... | 0 | -33.08 |
| 22518 | CN-114777340-A | CN-202210508526-A | CN | Concentrating solar energy seasonal sand high-temperature heat storage heating and hot water system | The invention relates to a light-concentrating solar cross-season sand high-temperature heat sto... | en | en | Concentrating solar energy seasonal sand high-temperature heat storage heating and hot water sys... | concentrating solar energy seasonal sand high-temperature heat storage heating hot water system ... | 1 | -26.01 |
| 32032 | CN-113775355-B | CN-202111218769-A | CN | Rock mass stable type supporting device for rock burst prevention in tunnel excavation and const... | The invention discloses a rock mass stable type supporting device for rock burst prevention in t... | en | en | Rock mass stable type supporting device for rock burst prevention in tunnel excavation and const... | rock mass stable type supporting device rock burst prevention tunnel excavation construction met... | 0 | 1.61 |
| 32194 | CN-113683242-B | CN-202110787033-A | CN | Utilize solar energy to realize source separation urine and excrement and urine resourceization&... | The invention provides a treatment system for recycling urine and excrement by utilizing solar e... | en | en | Utilize solar energy to realize source separation urine and excrement and urine resourceization&... | utilize solar energy realize source separation urine excrement urine resourceization 39 processi... | 6 | -74.02 |
sample_df_patent = do_stratification(patents_df, "topic", 1000, 69)
# if the articles are already in the subsample from the evaluation set, then we remove them, so we just want unique urls
sample_df_patent = sample_df_patent[~sample_df_patent["title"].isin(eval_abstracts_df_patent["title"]) & ~sample_df_patent["abstract"].isin(eval_abstracts_df_patent["abstract"])]
sample_df_patent = pd.concat([sample_df_patent, eval_abstracts_df_patent])
sample_df_patent.info()
<class 'pandas.core.frame.DataFrame'> Index: 1009 entries, 67506 to 67707 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 publication_number 1009 non-null object 1 application_number 1009 non-null object 2 country_code 1009 non-null object 3 title 1009 non-null object 4 abstract 1009 non-null object 5 title_lang 1009 non-null object 6 abstract_lang 1009 non-null object 7 text 1009 non-null object 8 cleaned_text 1009 non-null object 9 topic 1009 non-null int64 10 readability 1009 non-null float64 dtypes: float64(1), int64(1), object(9) memory usage: 94.6+ KB
domain_counts = sample_df_patent['topic'].value_counts()
print(domain_counts)
topic 0 173 1 128 6 113 2 111 8 106 5 96 3 94 7 75 4 67 9 46 Name: count, dtype: int64
original_domain_counts = patents_df["topic"].value_counts().to_frame()
original_domain_counts = original_domain_counts / original_domain_counts.sum() * 100
domain_counts_df = original_domain_counts.copy()
domain_counts_df["type"] = "Original"
sample_domain_counts = sample_df_patent["topic"].value_counts().to_frame()
sample_domain_counts = sample_domain_counts / sample_domain_counts.sum() * 100
sample_domain_counts["type"] = "Sample"
domain_counts_df = pd.concat([domain_counts_df, sample_domain_counts])
sns.barplot(
x=domain_counts_df.index,
y=domain_counts_df["count"],
hue=domain_counts_df["type"]
)
plt.title("Domain Distribution")
plt.xlabel("Domain")
plt.ylabel("Percentage")
plt.xticks(rotation=90)
plt.show()
Here the disribution for the subsampled dataset stays same as original
Chunking¶
Chunking is a crucial step in the RAG pipeline. It involves breaking down the articles into smaller, more manageable pieces.

There are mainly two reasons for this:
- Generation: The LLM has a limit on the number of tokens it can process at once. By chunking the articles, we can ensure that the LLM can generate responses without running into this limit. Another reason to use chunks for the generation step is to avoid "distractions" from irrelevant parts of the article. Just like if you were given a book and asked to answer a question about a the book it would be easier if you were just given the relevant chapter.
- Retrieval: Just like the LLM, the embedding model in the retrieval step has a limit on the number of tokens it can process at once. By chunking the articles, we can ensure that the embedding model can process the entire article. By chunking the articles, we can also improve the retrieval performance by having fine-grained chunks that can be matched more closely to the user query, rather then more general chunks.
Let's start by getting a better feeling for the most common size of chunks based on the number of characters
def get_lorem_text(num_chars: int) -> str:
expected_avg_word_len = 3 # on the lower side to be safe
text = lorem.words(num_chars // expected_avg_word_len)
return text[:num_chars]
print(wrap_text(get_lorem_text(256)))
fuga aperiam ipsa ipsum corrupti assumenda accusamus architecto quisquam eum dolorum maiores voluptatum consequatur omnis quibusdam hic temporibus ducimus quam veritatis delectus dolore nostrum quis rerum sint reprehenderit magni sit veniam minima officia
print(wrap_text(get_lorem_text(512)))
similique commodi quam sint inventore aliquam earum cumque obcaecati praesentium excepturi eius totam ab iure eveniet dolores hic possimus vero fugiat velit neque esse quod at provident nemo illum quisquam laborum tenetur itaque nihil tempore eligendi ut et voluptate aut fugit recusandae perferendis impedit architecto eum soluta assumenda sit reprehenderit quis voluptatibus amet aspernatur delectus dicta ratione cupiditate consequatur expedita quaerat doloribus rerum sed autem quibusdam incidunt repellendus
print(wrap_text(get_lorem_text(1024)))
rem nostrum autem laborum modi voluptates veniam voluptatum ea nesciunt iste officia alias iusto repellendus nisi sunt veritatis voluptatem error optio sed dolorum eaque laboriosam excepturi dicta aliquam at molestias adipisci a dolore iure eos doloribus maiores quidem ab consectetur reiciendis ullam nihil vel quo saepe facere omnis harum voluptas soluta amet suscipit itaque deleniti incidunt distinctio hic nam quae tempore obcaecati asperiores dolorem numquam deserunt nulla quod magni quos corrupti enim assumenda libero corporis nobis natus labore repudiandae ipsa illo quisquam eius minima aperiam accusamus vitae perferendis neque est molestiae culpa expedita sit ipsum similique delectus reprehenderit quaerat odit ut tempora aliquid ratione dolor perspiciatis eveniet possimus ex inventore praesentium dolores ducimus necessitatibus fugiat blanditiis placeat consequatur recusandae voluptate quibusdam rerum quia explicabo porro nemo illum accusantium officiis ad dignissimos magnam consequuntur provident asperna
print(wrap_text(get_lorem_text(2048)))
reprehenderit debitis perspiciatis fuga optio eligendi temporibus numquam facilis omnis magnam illo exercitationem non repudiandae quod cupiditate neque laborum sunt consequuntur voluptates praesentium accusantium expedita animi eos iste harum illum et soluta quas magni deleniti labore nulla error eveniet laudantium iure aliquid ducimus iusto commodi sint est totam tempora eaque amet ea dolorum blanditiis aut alias necessitatibus aspernatur quisquam vel similique quia corrupti a sit sed eum ab fugit in nostrum nisi autem ut beatae consectetur obcaecati fugiat dolores id ullam dignissimos esse quos reiciendis facere explicabo adipisci pariatur nam quis molestias voluptatum deserunt ipsam placeat porro quaerat officia repellat tempore delectus ex doloremque incidunt velit minima saepe consequatur libero officiis tenetur perferendis nesciunt atque quo quibusdam ratione minus rem dolore maxime quam odit asperiores voluptatibus quasi dolor veritatis voluptas unde hic architecto voluptate ipsa excepturi nemo qui inventore ipsum modi veniam mollitia cumque aliquam quae maiores sequi sapiente earum aperiam vero dolorem odio quidem assumenda molestiae eius distinctio culpa doloribus laboriosam voluptatem impedit corporis at enim provident nihil itaque cum rerum repellendus recusandae possimus vitae ad accusamus nobis natus suscipit dicta earum sequi assumenda libero impedit quibusdam consectetur quod modi neque distinctio ipsa corrupti veritatis odit et in voluptates magni inventore minima labore perspiciatis dolorem quidem molestiae eligendi odio maiores officia nostrum delectus pariatur ab iste sapiente iure error doloremque voluptatibus voluptatum quo est repellendus vitae nihil expedita nesciunt mollitia cupiditate aperiam doloribus sunt amet ipsum exercitationem blanditiis autem vel accusantium facere illum cum veniam esse animi atque ipsam at vero adipisci laboriosam ratione placeat facilis natus voluptatem quae quaerat ut non sint accusamus temporibus cumque necessitatibus aliquid quos debitis qui hic nam porro eaq
Creating the Chunks¶
In this notebook we will be using two different chunking strategies:
- Recursive Chunking: This strategy involves recursively splitting the article into smaller chunks based on the article structure such as paragraphs and sentences until the chunk size is less than or equal to the maximum chunk size.
- Semantic Chunking: This strategy involves splitting the article into chunks based on semantic boundaries. This strategy finds boundaries between sentences that are semantically different and splits the article at these boundaries to create chunks. To do this we will need to use an embedding model to calculate the similarity between sentences. These embedding models will then also be used in the retrieval step to find the most relevant chunks.
To see how different texts get chunked with different strategies and chunk sizes check out the Chunking Visualizer.
def get_recursive_splitter(chunk_size: int, chunk_overlap: int) -> TextSplitter:
return RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=["\n\n", "\n", "(?<=\. )", " ", ""],
length_function=len,
)
# the recursive splitter mainly relies on newlines, are there even any? No, so it will focus on sentences.
sample_df["article"].map(lambda x: x.count("\n")).sum()
0
sample_df_media["article"].map(lambda x: x.count(".")).sum()
51806
sample_df_patent["abstract"].map(lambda x: x.count(".")).sum()
2495
Let us set the device for efficient use of available resources.
# if we can make use of any device that is better than the CPU, we will use it
device = "cpu"
if torch.cuda.is_available():
device = "cuda"
elif torch.backends.mps.is_available():
device = "mps"
model_kwargs = {'device': device, "trust_remote_code": True}
model_kwargs
{'device': 'cuda', 'trust_remote_code': True}
We select three embedding models from HuggingFace to represent our text fragments in numerical forma in a vector space.
embedding_models = {
"mini": HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2", model_kwargs=model_kwargs),
"bge-m3": HuggingFaceEmbeddings(model_name="BAAI/bge-m3", model_kwargs=model_kwargs),
"gte": HuggingFaceEmbeddings(model_name="Alibaba-NLP/gte-base-en-v1.5", model_kwargs=model_kwargs),
}
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
modules.json: 0%| | 0.00/349 [00:00<?, ?B/s]
config_sentence_transformers.json: 0%| | 0.00/116 [00:00<?, ?B/s]
README.md: 0%| | 0.00/10.4k [00:00<?, ?B/s]
sentence_bert_config.json: 0%| | 0.00/53.0 [00:00<?, ?B/s]
config.json: 0%| | 0.00/571 [00:00<?, ?B/s]
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet` WARNING:huggingface_hub.file_download:Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
model.safetensors: 0%| | 0.00/438M [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/363 [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/466k [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/239 [00:00<?, ?B/s]
config.json: 0%| | 0.00/190 [00:00<?, ?B/s]
modules.json: 0%| | 0.00/349 [00:00<?, ?B/s]
config_sentence_transformers.json: 0%| | 0.00/123 [00:00<?, ?B/s]
README.md: 0%| | 0.00/15.8k [00:00<?, ?B/s]
sentence_bert_config.json: 0%| | 0.00/54.0 [00:00<?, ?B/s]
config.json: 0%| | 0.00/687 [00:00<?, ?B/s]
pytorch_model.bin: 0%| | 0.00/2.27G [00:00<?, ?B/s]
model.safetensors: 0%| | 0.00/2.27G [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/444 [00:00<?, ?B/s]
sentencepiece.bpe.model: 0%| | 0.00/5.07M [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/17.1M [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/964 [00:00<?, ?B/s]
config.json: 0%| | 0.00/191 [00:00<?, ?B/s]
modules.json: 0%| | 0.00/229 [00:00<?, ?B/s]
README.md: 0%| | 0.00/72.3k [00:00<?, ?B/s]
sentence_bert_config.json: 0%| | 0.00/54.0 [00:00<?, ?B/s]
config.json: 0%| | 0.00/1.35k [00:00<?, ?B/s]
configuration.py: 0%| | 0.00/7.13k [00:00<?, ?B/s]
A new version of the following files was downloaded from https://huggingface.co/Alibaba-NLP/new-impl: - configuration.py . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
modeling.py: 0%| | 0.00/59.0k [00:00<?, ?B/s]
A new version of the following files was downloaded from https://huggingface.co/Alibaba-NLP/new-impl: - modeling.py . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
model.safetensors: 0%| | 0.00/547M [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/1.38k [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/712k [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/695 [00:00<?, ?B/s]
config.json: 0%| | 0.00/297 [00:00<?, ?B/s]
We also define the chunking strategies to be used. The recursive splittering is characterized by the length of chunks and the overlap between adjacent chunks. For the semantic chunking, sentences embedded as dense vectors are merged as long as the cosine distance between two consecutive sentences does not exceed a percentile based threshold.
recursive_256_splitter = get_recursive_splitter(256, 64)
recursive_1024_splitter = get_recursive_splitter(1024, 128)
semantic_splitter = SemanticChunker(
embedding_models["gte"], breakpoint_threshold_type="percentile"
)
splitters = {
"recursive_256": recursive_256_splitter,
"recursive_1024": recursive_1024_splitter,
"semantic": semantic_splitter
}
from typing import Dict, List, Tuple
def chunk_documents(df: pd.DataFrame, text_splitter) -> Tuple[List[Document],]:
chunks = []
id = 0
for _, row in tqdm(df.iterrows(), total=len(df)):
if 'article' in df.columns:
content = row['article']
title = row.get('title', '')
full_text = f"{title}: {content}"
metadata = {
'title': title,
'url': row.get('url', ''),
'domain': row.get('domain', ''),
}
elif 'abstract' in df.columns:
content = row['abstract']
title = row.get('title', '')
full_text = f"{title}: {content}"
metadata = {
'title': title,
'topic': row.get('topic', ''),
}
else:
continue # Skip rows that don't have expected fields
char_chunks = text_splitter.split_text(full_text)
for chunk in char_chunks:
id += 1
chunk_metadata = metadata.copy()
chunk_metadata['id'] = id
chunks.append(Document(
page_content=chunk,
metadata=chunk_metadata,
))
return chunks
chunks_folder = silver_folder / "chunks"
if not chunks_folder.exists():
chunks_folder.mkdir()
The following function will load existing chunks, prepared for our tutorial to speed up the preparation process.
def get_or_create_chunks(df: pd.DataFrame, text_splitter: TextSplitter, splitter_name: str) -> List[Document]:
chunks_file = chunks_folder / f"{splitter_name}_chunks.json"
if chunks_file.exists():
with open(chunks_file, "r") as file:
chunks = [Document(**chunk) for chunk in json.load(file)]
print(f"Loaded {len(chunks)} chunks from {chunks_file}")
else:
chunks = chunk_documents(df, text_splitter)
with open(chunks_file, "w") as file:
json.dump([doc.dict() for doc in chunks], file, indent=4)
print(f"Saved {len(chunks)} chunks to {chunks_file}")
return chunks
Creating chunks for media and patent dataset.
chunks = {}
for splitter_name, splitter in splitters.items():
chunks[splitter_name+ "_media"] = get_or_create_chunks(sample_df_media, splitter, splitter_name + "_media")
chunks[splitter_name+ "_patent"] = get_or_create_chunks(sample_df_patent, splitter, splitter_name + "_patent")
Loaded 27820 chunks from data/data_new/silver/chunks/recursive_256_media_chunks.json Loaded 6777 chunks from data/data_new/silver/chunks/recursive_256_patent_chunks.json Loaded 6288 chunks from data/data_new/silver/chunks/recursive_1024_media_chunks.json Loaded 1825 chunks from data/data_new/silver/chunks/recursive_1024_patent_chunks.json Loaded 3457 chunks from data/data_new/silver/chunks/semantic_media_chunks.json Loaded 1313 chunks from data/data_new/silver/chunks/semantic_patent_chunks.json
Now that we have created and saved the chunks we can analyze them. We can already see above that the semantic chunks are generally larger than the recursive chunks.
Analyzing the Chunks¶
Let's start by looking at the first chunk of the first article to get a feeling for what the chunks look like depending on the chunking strategy and then we will look at the distribution of the chunk sizes and the number of chunks per article.
for splitter_name, splitter_chunks in chunks.items():
print(f"{splitter_name} chunks:")
print(wrap_text(splitter_chunks[0].page_content, char_per_line=150))
print()
recursive_256_media chunks: Charging Ahead: The UK’ s Electric Vehicle Revolution: Change is sweeping the highways of the United Kingdom. Being responsible for 88% of passenger miles and 79% of freight traffic, England's highways are crucial. Aside from that, despite making up just recursive_256_patent chunks: Light condensing device for photovoltaic module: The utility model discloses a light condensing device for a photovoltaic module, which comprises a fixing frame, wherein the top end of the fixing frame is rotatably connected with a solar panel, two sides recursive_1024_media chunks: Charging Ahead: The UK’ s Electric Vehicle Revolution: Change is sweeping the highways of the United Kingdom. Being responsible for 88% of passenger miles and 79% of freight traffic, England's highways are crucial. Aside from that, despite making up just 2% of roads, National Highways ' strategic road network ( SRN) handles one-third of passenger miles and two-thirds of freight miles in England. Connectivity is crucial for investment, community empowerment, and efficient domestic and international supply chains in the SRN. As they glide past their diesel and gasoline-powered competitors, sleek electric cars ( EVs) are bringing with them the promise of a more environmentally friendly future. An important strategic change driving the UK's ambitious decarbonisation targets is the recent uptick in the popularity of electric vehicles. In an effort to speed up the transition to electric vehicles, the UK government has enacted a number of regulations. According to a May 2019 report by the Committee on Climate Change recursive_1024_patent chunks: Light condensing device for photovoltaic module: The utility model discloses a light condensing device for a photovoltaic module, which comprises a fixing frame, wherein the top end of the fixing frame is rotatably connected with a solar panel, two sides of the solar panel are rotatably connected with rotating shafts, light condensing plates are fixedly arranged on the surfaces of the two rotating shafts, fixing rings are fixedly arranged on the two sides of the light condensing plates, steel wire ropes are tied on the surfaces of the two fixing rings, winding wheels are sleeved and connected on the two ends of the rotating shafts, the two winding wheels are wound and connected with corresponding steel wire ropes, and a fixing box is fixedly arranged on the side of the solar panel. According to the solar energy concentrating device, the driving gear is driven to rotate through the rotation of the servo motor, the driving gear is meshed with the driven gear, so that the rotating shaft is driven to rotate, the semantic_media chunks: Charging Ahead: The UK’ s Electric Vehicle Revolution: Change is sweeping the highways of the United Kingdom. Being responsible for 88% of passenger miles and 79% of freight traffic, England's highways are crucial. Aside from that, despite making up just 2% of roads, National Highways ' strategic road network ( SRN) handles one-third of passenger miles and two-thirds of freight miles in England. Connectivity is crucial for investment, community empowerment, and efficient domestic and international supply chains in the SRN. As they glide past their diesel and gasoline-powered competitors, sleek electric cars ( EVs) are bringing with them the promise of a more environmentally friendly future. An important strategic change driving the UK's ambitious decarbonisation targets is the recent uptick in the popularity of electric vehicles. In an effort to speed up the transition to electric vehicles, the UK government has enacted a number of regulations. According to a May 2019 report by the Committee on Climate Change ( CCC), in order to reach the net zero goal by 2035—or perhaps sooner—all new cars need to be powered by electricity. Aside from that, the government also presented its zero emission vehicle ( ZEV) mandate, which would advance the country’ s regulatory framework for the EV transition. Because of this, by 2030, 80% of new vehicles and 70% of new vans produced in Great Britain will have zero emissions, and by 2035, that number will rise to 100%. As of the end of sales in 2035, the UK will be on par with other big global economies, including Canada, France, Germany, and Sweden. By investing £5 billion in alternative alternatives and ending the sale of internal combustion engine cars by 2030–2035, the Transport Decarbonisation Plan, which was launched in 2021, also aims to reduce carbon emissions from transport. Along these lines, the National Highways Net Zero Plan aims to achieve zero emissions from operations by 2030 and from all maintenance and building by 2040. In keeping with larger sustainability objectives in transport, the strategy calls for a leadership position in HGV trials, investment in infrastructure, and assistance for drivers making the switch to zero-emission cars. Dcarbonise is Scotland's only exhibition and conference focused on reducing carbon emissions from the built environment and transportation systems. Together, let's redefine the possibilities for a sustainable future. semantic_patent chunks: Light condensing device for photovoltaic module: The utility model discloses a light condensing device for a photovoltaic module, which comprises a fixing frame, wherein the top end of the fixing frame is rotatably connected with a solar panel, two sides of the solar panel are rotatably connected with rotating shafts, light condensing plates are fixedly arranged on the surfaces of the two rotating shafts, fixing rings are fixedly arranged on the two sides of the light condensing plates, steel wire ropes are tied on the surfaces of the two fixing rings, winding wheels are sleeved and connected on the two ends of the rotating shafts, the two winding wheels are wound and connected with corresponding steel wire ropes, and a fixing box is fixedly arranged on the side of the solar panel. According to the solar energy concentrating device, the driving gear is driven to rotate through the rotation of the servo motor, the driving gear is meshed with the driven gear, so that the rotating shaft is driven to rotate, the rotating shaft drives the winding wheel to rotate and the light collecting plate to rotate, sunlight is focused on the surface of the solar panel through the angle adjustment of the light collecting plate, and the utilization rate of the solar panel for focusing is improved.
def plot_chunk_lengths(chunks: List[Document], title: str):
sns.histplot([len(chunk.page_content) for chunk in chunks], kde=True)
plt.title(title)
plt.xlabel("Chunk length (characters)")
plt.ylabel("Number of chunks")
median_chunk_len = np.median([len(chunk.page_content) for chunk in chunks])
mean_chunk_len = np.mean([len(chunk.page_content) for chunk in chunks])
plt.axvline(median_chunk_len, color='r', linestyle='--', label=f"Median chunk length: {median_chunk_len:.2f}")
plt.axvline(mean_chunk_len, color='g', linestyle='--', label=f"Mean chunk length: {mean_chunk_len:.2f}")
plt.legend()
plt.show()
plot_chunk_lengths(chunks["recursive_256_media"], "Chunk lengths for recursive 256 splitter")
plot_chunk_lengths(chunks["recursive_256_patent"], "Chunk lengths for recursive 256 splitter")
plot_chunk_lengths(chunks["recursive_1024_media"], "Chunk lengths for recursive 1024 splitter")
In patent datasset there are lot of chunks which could not have length 1024 and have very less chunk length.
plot_chunk_lengths(chunks["recursive_1024_patent"], "Chunk lengths for recursive 1024 splitter")
plot_chunk_lengths(chunks["semantic_media"], "Chunk lengths for semantic splitter")
Diffennce in semantic chunks for both the datasets: Here we have very less amount of semantic chunks for patent dataset and mean chunk length for media is 1524 but patent is 966 only.
plot_chunk_lengths(chunks["semantic_patent"], "Chunk lengths for semantic splitter")
chunks_per_article = {splitter_name: Counter([chunk.metadata["title"] for chunk in chunks]) for splitter_name, chunks in chunks.items()}
counts = {splitter_name: [count for title, count in chunk_counts.items()] for splitter_name, chunk_counts in chunks_per_article.items()}
sns.histplot(counts, kde=True)
plt.title("Number of chunks per article")
plt.xlabel("Number of chunks")
plt.ylabel("Number of articles")
plt.legend(chunks_per_article.keys())
plt.show()
From our analysis of our created chunks we can see that the recursive chunks are all around the same size for media dataset but not the patent dataset, close to the defined maximum. On the other hand, the semantic chunks vary in size for both of them. This is because the semantic chunking strategy is based on the semantic boundaries of the article.
We can also see that despite the semantic chunks being larger, the distribution of the number of chunks per article is much wider for the recursive chunks. This is because the recursive chunks are all around the same size, while the semantic chunks have many smaller ones and a few larger ones.
Generating Embeddings¶
Now that we have clean chunks, the next step involves generating embeddings for our article chunks. These embeddings will serve as a crucial component for efficient retrieval within the RAG pipeline. For our vector store we'll utilize ChromaDB, a powerful tool for indexing and searching high-dimensional data. To integrate our chosen embedding models with ChromaDB, we'll define a custom wrapper class. This wrapper class will act as an intermediary, ensuring seamless communication between the models and the ChromaDB indexing system.
class CustomChromadbEmbeddingFunction(EmbeddingFunction):
def __init__(self, model) -> None:
super().__init__()
self.model = model
def _embed(self, l):
return [self.model.embed_query(x) for x in l]
def embed_query(self, query):
return self._embed([query])
def __call__(self, input: Documents) -> Embeddings:
embeddings = self._embed(input)
return embeddings
We compared all 3 embeddings and we finalized using "bge-m3" for our pipeline, in order to save money and processing time, we only use this for the evaluation and comparison for media and patent dataset.
chroma_embedding_functions = {
# "mini": CustomChromadbEmbeddingFunction(embedding_models["mini"]),
"bge-m3": CustomChromadbEmbeddingFunction(embedding_models["bge-m3"]),
# "gte": CustomChromadbEmbeddingFunction(embedding_models["gte"]),
}
for name, embedding_function in chroma_embedding_functions.items():
sample = embedding_function(["Hello, world!"])[0][:5]
print(f"{name} embedding sample: {sample}")
bge-m3 embedding sample: [-0.016155613586306572, 0.026993418112397194, -0.04258323833346367, 0.013542186468839645, -0.019354626536369324]
Generating embeddings can be a computationally intensive process. To optimize efficiency and avoid redundant computations, we'll leverage checkpointing. This technique involves storing the generated embeddings along with their corresponding article chunks. We'll define a simple class to encapsulate this data, facilitating efficient retrieval and reducing the need for recalculating embeddings unless absolutely necessary.
embeddings_folder = silver_folder / "embeddings"
if not embeddings_folder.exists():
embeddings_folder.mkdir()
class DocumentEmbedding():
def __init__(self, document: Document, text_embedding: List[float]) -> None:
self.document = document
self.text_embedding = text_embedding
def to_dict(self) -> Dict:
return {
"document": self.document.dict(),
"text_embedding": self.text_embedding
}
@classmethod
def from_dict(cls, d: Dict) -> "DocumentEmbedding":
return cls(
document=Document(**d["document"]),
text_embedding=d["text_embedding"]
)
def get_or_create_embeddings(
embedding_function: EmbeddingFunction,
chunks: List[Document],
embedding_name: str,
) -> List[DocumentEmbedding]:
embeddings_file = embeddings_folder / f"{embedding_name}_embeddings.json"
if embeddings_file.exists():
with open(embeddings_file, "r") as file:
embeddings = [DocumentEmbedding.from_dict(embedding) for embedding in json.load(file)]
print(f"Loaded {len(embeddings)} embeddings from {embeddings_file}")
else:
embeddings = []
for chunk in tqdm(chunks):
text_embedding = embedding_function([chunk.page_content])[0]
embedding = DocumentEmbedding(
document=chunk,
text_embedding=text_embedding
)
embeddings.append(embedding)
with open(embeddings_file, "w") as file:
json.dump([embedding.to_dict() for embedding in embeddings], file, indent=4)
print(f"Saved {len(embeddings)} embeddings to {embeddings_file}")
return embeddings
embeddings = {}
for embedding_name, embedding_function in chroma_embedding_functions.items():
for splitter_name, splitter_chunks in chunks.items():
print(f"Generating embeddings for {embedding_name} with {splitter_name} splitter")
embeddings[f"{embedding_name}_{splitter_name}"] = get_or_create_embeddings(
embedding_function, splitter_chunks, f"{embedding_name}_{splitter_name}"
)
Generating embeddings for bge-m3 with recursive_256_media splitter Loaded 27820 embeddings from data/data_new/silver/embeddings/bge-m3_recursive_256_media_embeddings.json Generating embeddings for bge-m3 with recursive_256_patent splitter Loaded 6860 embeddings from data/data_new/silver/embeddings/bge-m3_recursive_256_patent_embeddings.json Generating embeddings for bge-m3 with recursive_1024_media splitter Loaded 6288 embeddings from data/data_new/silver/embeddings/bge-m3_recursive_1024_media_embeddings.json Generating embeddings for bge-m3 with recursive_1024_patent splitter Loaded 1825 embeddings from data/data_new/silver/embeddings/bge-m3_recursive_1024_patent_embeddings.json Generating embeddings for bge-m3 with semantic_media splitter Loaded 3457 embeddings from data/data_new/silver/embeddings/bge-m3_semantic_media_embeddings.json Generating embeddings for bge-m3 with semantic_patent splitter Loaded 1332 embeddings from data/data_new/silver/embeddings/bge-m3_semantic_patent_embeddings.json
The number of embeddings relates to the number of chunks produced by the individual chunking strategies, not the embedding dimensions. Thus smaller chunk size (e.g. 256) yields more chunks than larger chunk size (1024), and semantic embeddings even less chunks.
Storing the Embeddings in ChromaDB¶
As mentioned above for our semantic search retrieval we will be storing the embeddings in ChromaDB. ChromaDB is a powerful tool for indexing and searching high-dimensional data. It is allows e.g. to use approximate nearest neighbor (ANN) search based on the Hierarchical Navigable Small World (HNSW) algorithm, which is known for its efficiency in searching high-dimensional spaces.
Just like with normal SQL databases we have a server, in this case an SQLite server, that we can connect to with a client. We will then use the client to connect to the server and create for each set of embeddings a new seperate database which can be thought of as the index or a vector space. ChromaDB calls these separate vector spaces "collections". These collections will then be used to search for the most relevant chunks to a user query.

gold_folder = data_folder / "gold"
if not gold_folder.exists():
gold_folder.mkdir()
chromadb_folder = gold_folder / "chromadb"
if not chromadb_folder.exists():
chromadb_folder.mkdir()
chroma_client = chromadb.PersistentClient(path=chromadb_folder.as_posix())
Again we can make use of preprocessed data as before to speed up the preparatory steps.
def get_or_create_collection(
name: str,
embedding_function: EmbeddingFunction,
embeddings: List[DocumentEmbedding],
batch_size: int = 128
) -> Collection:
collection = chroma_client.get_or_create_collection(
name=name,
# configure to use cosine distance not default L2
metadata={"hnsw:space": "cosine"},
embedding_function=embedding_function
)
if collection.count() == 0:
for i in tqdm(range(0, len(embeddings), batch_size)):
batch = embeddings[i:i+batch_size]
collection.add(
documents=[embedding.document.page_content for embedding in batch],
embeddings=[embedding.text_embedding for embedding in batch],
ids=[str(embedding.document.metadata["id"]) for embedding in batch],
metadatas=[embedding.document.metadata for embedding in batch]
)
return collection
collections = {}
for collection_name, current_embeddings in embeddings.items():
collection = get_or_create_collection(
collection_name,
chroma_embedding_functions[collection_name.split("_")[0]],
current_embeddings
)
collections[collection_name] = collection
print(f"Collection {collection_name} has {collection.count()} documents")
Collection bge-m3_recursive_256_media has 27264 documents Collection bge-m3_recursive_256_patent has 6852 documents Collection bge-m3_recursive_1024_media has 6166 documents Collection bge-m3_recursive_1024_patent has 1825 documents Collection bge-m3_semantic_media has 3390 documents Collection bge-m3_semantic_patent has 1332 documents
The above printout shows the three embedding models applied to the three chunking strategies.
Once we have stored all the embeddings in ChromaDB we can test the retrieval process by querying one of our collections and see what the most similar chunks are. Try some different queries and see what the most similar chunks are and whether they make sense.
To change¶
selected_collection_media = collections["bge-m3_recursive_1024_media"]
results = selected_collection_media.query(
query_texts=["Climate Change"],
n_results=3,
)
for doc in results["documents"][0]:
print(wrap_text(doc))
print()
Climate change, water, and the energy transition challenge: Changes in the Earth’ s climate and weather patterns are already having a significant impact on many people’ s ability to access water. Concerningly, with the Paris Agreement in danger and global net zero targets still some way off, it’ s likely this problem will get worse before it gets better. This should give cause for concern when you consider that water is critical to the production of virtually all energy. Whether it be for use in raw material extraction for renewables components, steam generation for gas turbines, as a feedstock for green hydrogen, or as a cooling agent in nuclear power facilities and carbon capture and storage plants, H2O is integral. In the 2024 edition of our UK Energy Transition Outlook ( ETO), released earlier this year, DNV forecasts that by 2040, as the transition progresses and the UK’ s energy landscape evolves, electricity will account for about half of final demand. By 2050, low carbon sources are predicted to on track to meet Paris Agreement goals. Without stronger policy action, the global heat sector alone between 2023 and 2028 could consume more than one‑fifth of the remaining carbon budget for a pathway aligned with limiting global warming to 1.5°C. Global renewable heat consumption would have to rise 2.2 times as quickly and be combined with wide-scale demand-side measures and much larger energy and material efficiency improvements to align with the NZE Scenario. Get updates on the IEA’ s latest news, analysis, data and events delivered twice monthly. Thank you for subscribing. You can unsubscribe at any time by clicking the link at the bottom of any IEA newsletter. Climate Change Archives - Page 481 of 481: A new report has found that Google is breaking its October 2021 promise to not sell ads on YouTube videos containing climate misinformation. The... None of these red flags by themselves make a company, a product, or a purported solution a guaranteed failure or an outright scam. But... Mapping a large coastal glacier in Alaska revealed that its bulk sits below sea level and is undercut by channels, making it vulnerable to... “ If anyone eats anything from the ocean, you’ ve got to care about marine heat waves, ” argues Brenda Ekwurzel, a climate scientist at the Union... Japan will either figure it out or suffer the consequences of being completely unable to compete internationally and see their economy collapse to the... CleanTechnica is the # 1 cleantech-focused news & analysis website in the US & the world, focusing primarily on electric cars, solar energy, wind energy, & energy storage. News is published on CleanTechnica.com and reports are published on
selected_collection_patent = collections["bge-m3_recursive_1024_patent"]
results = selected_collection_patent.query(
query_texts=["Climate Change"],
n_results=3,
)
for doc in results["documents"][0]:
print(wrap_text(doc))
print()
heat pipeline, the hydrogen/oxygen fuel cell system is connected with a hot water tank for providing temperature difference for a temperature difference power generation device through a fourth circulating heat pipeline, and the temperature difference power generation device is electrically connected with a second storage battery through a lead, compared with the prior art, the utility model has the following beneficial effects: the solar energy is collected and converted into hydrogen energy with higher energy density, and the hydrogen energy is applied to the hydrogen/oxygen fuel cell system. Heat exchange station system and method for heating secondary net water by using geothermal energy: The invention provides a heat exchange station system and a method for heating secondary network water by utilizing geothermal energy, comprising an electric heating pump unit, a geothermal heating unit and a geothermal heat source unit, wherein the geothermal heat source unit comprises a geothermal water inlet pipeline and a geothermal water return pipeline, the outlet of the geothermal water inlet pipeline is divided into two paths, one path is connected with a heat source fluid inlet of the geothermal heating unit, and the other path is connected with a heat source fluid inlet of the electric heating pump unit; the outlet of the heated fluid of the geothermal heating unit is connected with the inlet of a secondary network water supply pipeline; a condensed fluid outlet of the electric heating pump unit is connected with an inlet of a secondary network water supply pipeline; a heat source fluid outlet of the on the analysis result, the fuel quantity of various substances involved in the combustion process and the recovery quantity of high-temperature flue gas are adjusted, so that new combustible ultralow-nitrogen and hydrogen-rich mixed gas with a brand new proportion is realized, and the purposes of increasing the combustion temperature, ultralow-nitrogen emission, minimum smoke discharge and energy conservation and environmental protection are achieved.
Analyzing the Embedding Space¶
To gain a better understandign of how the retrieval process works we will analyze the embedding space. We will start by projecting the embeddings into a 2D space using UMAP. UMAP is a dimensionality reduction technique that is particularly well-suited for visualizing high-dimensional data in a lower-dimensional space. The most notable advantages over other dimensionality reduction techniques are increased speed and better preservation of the data's global structure. We will then use the UMAP embeddings to create a scatter plot of the chunks.
def get_vectors_from_collection(collection: Collection):
stored_chunks = collection.get(include=["documents", "metadatas", "embeddings"])
return np.array(stored_chunks["embeddings"])
def get_vectors_by_domain(collection: Collection, domain: str):
stored_chunks = collection.get(include=["documents", "metadatas", "embeddings"])
metadatas = stored_chunks["metadatas"]
indices = []
for metadata in metadatas:
# Check if either 'topic' or 'domain' key exists in metadata
if "topic" in metadata and metadata["topic"] == domain:
indices.append(str(metadata["id"]))
elif "domain" in metadata and metadata["domain"] == domain:
indices.append(str(metadata["id"]))
return collection.get(include=["embeddings"], ids=indices)["embeddings"]
def fit_umap(vectors: np.ndarray):
return umap.UMAP().fit(vectors)
def project_embeddings(embeddings, umap_transform):
return umap_transform.transform(embeddings)
The dimensions above show how the chunked embeddings with 768 dimensions are reduced to two dimensions for visualization purposes.
You can zoom in the plot by clicking and dragging a box around the area you want to zoom in on. You can also reset the plot by double clicking on the plot.
Next we will color the embeddings by the domain of the article to see if there are any patterns or clusters in the embedding space based on the domain.
We can also visualize the retrieval process by plotting the query and the most similar chunks in the embedding space. This will give us a better understanding of how the retrieval process works and how the most similar chunks are found.
Note that the UMAP projection uses a metric approach which differs from the approximate nearest neighbor approach used for retrieval. Also don't forget that the embeddings are in a high-dimensional space and we are only visualizing a 2D projection of them so the distances between the points might not be accurate. Try some different queries and see how the most similar chunks are found.
def plot_retrieval_results(
query: str,
selected_collection: Collection,
n_results: int = 5
):
vectors = get_vectors_from_collection(selected_collection)
umap_transform = fit_umap(vectors)
vectors_projections = project_embeddings(vectors, umap_transform)
query_embedding = selected_collection._embedding_function([query])[0]
query_embedding = np.array(query_embedding).reshape(1, -1)
query_projection = project_embeddings(query_embedding, umap_transform)
nearest_neighbors = selected_collection.query(
query_texts=[query],
n_results=n_results,
)
neighbor_vectors = selected_collection.get(include=["embeddings"], ids=nearest_neighbors["ids"][0])["embeddings"]
neighbor_projections = project_embeddings(neighbor_vectors, umap_transform)
fig = go.Figure()
fig.add_trace(go.Scatter(x=vectors_projections[:, 0], y=vectors_projections[:, 1], mode='markers', marker=dict(size=5), name="other vectors"))
fig.add_trace(go.Scatter(x=neighbor_projections[:, 0], y=neighbor_projections[:, 1], mode='markers', marker=dict(size=5, color='orange'), name="nearest neighbors"))
fig.add_trace(go.Scatter(x=query_projection[:, 0], y=query_projection[:, 1], mode='markers', marker=dict(size=10, color='red', symbol='x'), name="query"))
fig.show(renderer="colab")
Lastly we will analyze the distribution of the cosine distances between the query and the different chunks. This will give us a better understanding of the cosine distance and show that the distances in the high-dimensional space are not the same as in the 2D projection. Do not confuse the cosine distance with the cosine similarity. The cosine similarity is the cosine of the angle between two vectors and the cosine distance is 1 minus the cosine similarity so that smaller numbers mean the vectors are more similar.
def cosine_distance(vector1, vector2):
dot_product = np.dot(vector1, vector2.T)
norm_product = np.linalg.norm(vector1) * np.linalg.norm(vector2)
similarity = dot_product / norm_product
return 1 - similarity
def plot_cosine_distances(
query: str,
selected_collection: Collection
):
vectors = get_vectors_from_collection(selected_collection)
umap_transform = fit_umap(vectors)
vectors_projections = project_embeddings(vectors, umap_transform)
query_embedding = selected_collection._embedding_function([query])[0]
query_embedding = np.array(query_embedding).reshape(1, -1)
query_projection = project_embeddings(query_embedding, umap_transform)
similarities = np.array([cosine_distance(query_embedding, vector) for vector in vectors])
fig = go.Figure()
fig.add_trace(go.Scatter(
x=vectors_projections[:, 0],
y=vectors_projections[:, 1],
mode='markers',
marker=dict(
size=5,
color=similarities.flatten(),
colorscale='RdBu',
colorbar=dict(title='Cosine Distance')
),
text=['Cosine Distance: {:.4f}'.format(
sim) for sim in similarities.flatten()],
name='Other Vectors'
))
fig.add_trace(go.Scatter(x=[query_projection[0][0]], y=[
query_projection[0][1]], mode='markers', marker=dict(size=10, color='black', symbol='x'), text=['Query Vector'], name='Query Vector'))
fig.show(renderer="colab")
Analyze the embedding space for media dataset¶
vectors = get_vectors_from_collection(selected_collection_media)
print(f"Original shape: {vectors.shape}")
umap_transform = fit_umap(vectors)
vectors_projections = project_embeddings(vectors, umap_transform)
print(f"Projected shape: {vectors_projections.shape}")
Original shape: (6166, 1024) Projected shape: (6166, 2)
fig = px.scatter(x=vectors_projections[:, 0], y=vectors_projections[:, 1])
fig.show(renderer="colab")
fig = go.Figure()
for domain in sample_df_media["domain"].unique():
domain_vectors = get_vectors_by_domain(selected_collection_media, domain)
domain_projections = project_embeddings(domain_vectors, umap_transform)
fig.add_trace(go.Scatter(x=domain_projections[:, 0], y=domain_projections[:, 1], mode='markers', marker=dict(size=4), name=domain))
fig.show(renderer="colab")
plot_retrieval_results(
"Climate Change",
selected_collection_media,
)
plot_cosine_distances(
"Climate Change",
selected_collection_media,
)
Analyze the embedding space for media dataset¶
vectors = get_vectors_from_collection(selected_collection_patent)
print(f"Original shape: {vectors.shape}")
umap_transform = fit_umap(vectors)
vectors_projections = project_embeddings(vectors, umap_transform)
print(f"Projected shape: {vectors_projections.shape}")
Original shape: (1825, 1024) Projected shape: (1825, 2)
fig = px.scatter(x=vectors_projections[:, 0], y=vectors_projections[:, 1])
fig.show(renderer="colab")
fig = go.Figure()
for domain in sample_df_patent["topic"].unique():
domain_vectors = get_vectors_by_domain(selected_collection_patent, domain)
domain_projections = project_embeddings(domain_vectors, umap_transform)
# Convert the domain to a string before passing it as the 'name'
fig.add_trace(go.Scatter(x=domain_projections[:, 0], y=domain_projections[:, 1], mode='markers', marker=dict(size=4), name=str(domain)))
fig.show(renderer="colab")
plot_retrieval_results(
"Climate Change",
selected_collection_patent,
)
plot_cosine_distances(
"Climate Change",
selected_collection_patent,
)
plot_retrieval_results(
"Renewable Energy",
selected_collection_media,
)
plot_cosine_distances(
"Renewable Energy",
selected_collection_media,
)
plot_retrieval_results(
"Renewable Energy",
selected_collection_patent,
)
plot_cosine_distances(
"Renewable Energy",
selected_collection_patent,
)
The patent dataset in the above plots has more spread in cosine distances and embedding space compared to the media dataset, it suggests that the patent data contains more diverse or distinct concepts, with the model capturing a wider range of semantic differences. On the other hand, the media dataset likely has more homogeneous or similar embeddings, indicating less diversity or more similarity between the data points. Also the higher cosine distance means that the vectors (representing the query and the chunks) are more dissimilar or farther apart in the high-dimensional space. Since cosine distance is calculated as 1 - cosine similarity, a larger cosine distance corresponds to a smaller cosine similarity, indicating that the vectors are less similar or more orthogonal to each other.
Putting it all Together¶
Now that we have generated the embeddings and stored them in ChromaDB we can put it all together and create the RAG pipeline. The RAG pipeline consists of the following steps:
- Indexing: The first step is the preperation we have already done. We have chunked the articles and generated embeddings for the chunks and stored them in ChromaDB, our vector store/index.
- Retrieval: The next step in the RAG pipeline is to retrieve the most relevant chunks to the user query. This is done by querying the ChromaDB index with the user query and retrieving the most similar chunks.
- Generation: The next step is to generate a response to the user query. This is done by feeding the retrieved chunks and the user query to the LLM and generating a response.
How does Langchain work?¶
In this notebook we will be using Langchain to build up our pipeline. You do not need a library like Langchain or LlamaIndex to build a RAG pipeline, but it can make the process easier.
The idea of Langchain and its LCEL (Langchain Expression Language) is very simple. Within the pipeline there are lots of steps that take an input and produce an output. These steps can be chained together to form a pipeline. The LCEL is a simple language that allows you to define these steps and how they are connected. For more technical details on how Langchain works check out the Langchain Documentation.
In simple terms langchain provides an abstraction of a step that has an invoke method that takes an input, a dictionary of parameters and returns an output also a dictionary. This allows you to chain together different steps and define how they are connected and also split of chains of steps into separate pipelines.
Below you can see an overview of our RAG pipeline:

And now let's look at the implementation of the RAG pipeline.
def create_qa_chain(retriever: BaseRetriever):
template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. \
If you don't know the answer, just say that you don't know. Keep the answer concise.
Question: {question}
Context: {context}
Answer:
"""
rag_prompt = ChatPromptTemplate.from_template(template)
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = RunnableParallel(
{
"context": retriever,
"question": RunnablePassthrough()
}
).assign(answer=(
RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
| rag_prompt
| llm
| StrOutputParser()
))
return rag_chain
For Langchain to work with our ChromaDB collections we need to transform the collections into a format that Langchain can work with so called stores and retrievers.
def collection_to_store(collection_name: str, lc_embedding_model: EmbeddingFunction):
return Chroma(
client=chroma_client,
collection_name=collection_name,
embedding_function=lc_embedding_model,
)
def store_to_retriever(store: VectorStore, k: int = 3):
retriever = store.as_retriever(
search_type="similarity", search_kwargs={'k': k}
)
return retriever
To check¶
Now that we have our retriever we can create our RAG pipeline. Try some different queries and see how the pipeline responds.
selected_store_media = collection_to_store("bge-m3_recursive_1024_media", embedding_models["bge-m3"])
selected_retriever_media = store_to_retriever(selected_store_media)
selected_retriever_media.invoke("Climate Change")
[Document(metadata={'domain': 'energyvoice', 'id': 3186, 'title': 'Climate change, water, and the energy transition challenge', 'url': 'energyvoice.com/renewables-energy-transition/hydrogen/552964/climate-change-water-and-the-energy-transition-managing-the-balance'}, page_content='Climate change, water, and the energy transition challenge: Changes in the Earth’ s climate and weather patterns are already having a significant impact on many people’ s ability to access water. Concerningly, with the Paris Agreement in danger and global net zero targets still some way off, it’ s likely this problem will get worse before it gets better. This should give cause for concern when you consider that water is critical to the production of virtually all energy. Whether it be for use in raw material extraction for renewables components, steam generation for gas turbines, as a feedstock for green hydrogen, or as a cooling agent in nuclear power facilities and carbon capture and storage plants, H2O is integral. In the 2024 edition of our UK Energy Transition Outlook ( ETO), released earlier this year, DNV forecasts that by 2040, as the transition progresses and the UK’ s energy landscape evolves, electricity will account for about half of final demand. By 2050, low carbon sources are predicted to'),
Document(metadata={'domain': 'iea', 'id': 3468, 'title': 'Executive summary – Renewables 2023 – Analysis', 'url': 'iea.org/reports/renewables-2023/executive-summary'}, page_content='on track to meet Paris Agreement goals. Without stronger policy action, the global heat sector alone between 2023 and 2028 could consume more than one‑fifth of the remaining carbon budget for a pathway aligned with limiting global warming to 1.5°C. Global renewable heat consumption would have to rise 2.2 times as quickly and be combined with wide-scale demand-side measures and much larger energy and material efficiency improvements to align with the NZE Scenario. Get updates on the IEA’ s latest news, analysis, data and events delivered twice monthly. Thank you for subscribing. You can unsubscribe at any time by clicking the link at the bottom of any IEA newsletter.'),
Document(metadata={'domain': 'cleantechnica', 'id': 1194, 'title': 'Climate Change Archives - Page 481 of 481', 'url': 'cleantechnica.com/category/climate-change/page/481'}, page_content='Climate Change Archives - Page 481 of 481: A new report has found that Google is breaking its October 2021 promise to not sell ads on YouTube videos containing climate misinformation. The... None of these red flags by themselves make a company, a product, or a purported solution a guaranteed failure or an outright scam. But... Mapping a large coastal glacier in Alaska revealed that its bulk sits below sea level and is undercut by channels, making it vulnerable to... “ If anyone eats anything from the ocean, you’ ve got to care about marine heat waves, ” argues Brenda Ekwurzel, a climate scientist at the Union... Japan will either figure it out or suffer the consequences of being completely unable to compete internationally and see their economy collapse to the... CleanTechnica is the # 1 cleantech-focused news & analysis website in the US & the world, focusing primarily on electric cars, solar energy, wind energy, & energy storage. News is published on CleanTechnica.com and reports are published on')]
selected_chain_media = create_qa_chain(selected_retriever_media)
selected_chain_media.invoke("Where are the biggest increases in wildfire smoke exposure in recent years?")
{'context': [Document(metadata={'domain': 'cleantechnica', 'id': 1300, 'title': 'Some Gas Utilities Are Leading The Way Into The Future, Others Are Actively In Reverse', 'url': 'cleantechnica.com/2023/05/11/some-gas-utilities-are-leading-the-way-into-the-future-others-are-actively-in-reverse'}, page_content='halved. Australians watching the news about Canada’ s current spate of wildfires will remember our own horrifying bushfires a couple of years ago. The fear is... Copyright © 2023 CleanTechnica. The content produced by this site is for entertainment purposes only. Opinions and comments published on this site may not be sanctioned by and do not necessarily represent the views of CleanTechnica, its owners, sponsors, affiliates, or subsidiaries.'),
Document(metadata={'domain': 'cleantechnica', 'id': 938, 'title': 'World’ s Largest Floating Solar Array, Manchin Movement On Climate — Nexus News Roundup', 'url': 'cleantechnica.com/2022/01/06/worlds-largest-floating-solar-array-manchin-movement-on-climate-nexus-news-roundup'}, page_content='grapple with wildfire disaster in their backyard ( Axios), photos: wildfires engulf 1,000 homes in suburban Denver ( NPR). A syndicated newswire covering climate, energy, policy, art and culture. Advertise with CleanTechnica to get your company in front of millions of monthly readers. The SAIC–GM–Wuling ( SGMW) joint venture has been setting the trends in China when it comes to cool, affordable, mini EVs. After the smash hit... The blockbuster Wuling Hongguang Mini EV has been a huge success in China. It has been so successful that it has created a totally... Japan will either figure it out or suffer the consequences of being completely unable to compete internationally and see their economy collapse to the... Plugin vehicles are all the rage in the Chinese auto market. Plugins scored over half a million sales last month, up 23% year over... Copyright © 2023 CleanTechnica. The content produced by this site is for entertainment purposes only. Opinions and comments published on this site may not be sanctioned'),
Document(metadata={'domain': 'cleantechnica', 'id': 937, 'title': 'World’ s Largest Floating Solar Array, Manchin Movement On Climate — Nexus News Roundup', 'url': 'cleantechnica.com/2022/01/06/worlds-largest-floating-solar-array-manchin-movement-on-climate-nexus-news-roundup'}, page_content='day, making it the most destructive wildfire in the state’ s history. “ With CLIMATE CHANGE, there is no FIRE SEASON anymore, ” tweeted Mike Nelson, chief meteorologist for Denver7, the city’ s ABC affiliate. Climate change, primarily caused by the extraction and combustion of fossil fuels, supercharges fires like the Marshall Fire through increased temperatures and exacerbated drought. The blaze, which ignited Thursday, was effectively extinguished by snowfall by the next day. More than 30,000 people were forced to evacuate but just two people were missing as of Monday. How to give and get help ( Boulder Daily Camera), how climate change primed Colorado for a rare December wildfire ( CNBC), fires outside of Denver were the most destructive in Colorado history ( NPR), climate change-fueled blaze destroys 1,000 homes in Colorado in rare winter wildfire ( Democracy Now), climate scientists grapple with wildfire disaster in their backyard ( Axios), photos: wildfires engulf 1,000 homes in suburban Denver ( NPR).')],
'question': 'Where are the biggest increases in wildfire smoke exposure in recent years?',
'answer': "I don't know."}
selected_store_patent = collection_to_store("bge-m3_recursive_1024_patent", embedding_models["bge-m3"])
selected_retriever_patent = store_to_retriever(selected_store_patent)
selected_retriever_patent.invoke("Climate Change")
[Document(metadata={'id': 981, 'title': 'Hydrogen energy recycling system in extremely cold region', 'topic': 4}, page_content='heat pipeline, the hydrogen/oxygen fuel cell system is connected with a hot water tank for providing temperature difference for a temperature difference power generation device through a fourth circulating heat pipeline, and the temperature difference power generation device is electrically connected with a second storage battery through a lead, compared with the prior art, the utility model has the following beneficial effects: the solar energy is collected and converted into hydrogen energy with higher energy density, and the hydrogen energy is applied to the hydrogen/oxygen fuel cell system.'),
Document(metadata={'id': 341, 'title': 'Heat exchange station system and method for heating secondary net water by using geothermal energy', 'topic': 1}, page_content='Heat exchange station system and method for heating secondary net water by using geothermal energy: The invention provides a heat exchange station system and a method for heating secondary network water by utilizing geothermal energy, comprising an electric heating pump unit, a geothermal heating unit and a geothermal heat source unit, wherein the geothermal heat source unit comprises a geothermal water inlet pipeline and a geothermal water return pipeline, the outlet of the geothermal water inlet pipeline is divided into two paths, one path is connected with a heat source fluid inlet of the geothermal heating unit, and the other path is connected with a heat source fluid inlet of the electric heating pump unit; the outlet of the heated fluid of the geothermal heating unit is connected with the inlet of a secondary network water supply pipeline; a condensed fluid outlet of the electric heating pump unit is connected with an inlet of a secondary network water supply pipeline; a heat source fluid outlet of the'),
Document(metadata={'id': 921, 'title': 'Control method of hydrogen energy combustion-supporting ultralow nitrogen combustor', 'topic': 4}, page_content='on the analysis result, the fuel quantity of various substances involved in the combustion process and the recovery quantity of high-temperature flue gas are adjusted, so that new combustible ultralow-nitrogen and hydrogen-rich mixed gas with a brand new proportion is realized, and the purposes of increasing the combustion temperature, ultralow-nitrogen emission, minimum smoke discharge and energy conservation and environmental protection are achieved.')]
selected_chain_patent = create_qa_chain(selected_retriever_patent)
selected_chain_patent.invoke("How does the system reduce the cost of cold storage?")
{'context': [Document(metadata={'id': 1817, 'title': 'Distributed photovoltaic energy storage refrigeration house system', 'topic': 7}, page_content='the cold accumulation compressor unit; the refrigeration house refrigeration system comprises a refrigeration house body, a refrigeration house compressor, a refrigeration house evaporator and a refrigeration house condenser, wherein the refrigeration house body is in contact with the cold accumulation equipment, the refrigeration house evaporator and the refrigeration house compressor are located in the refrigeration house body, and the refrigeration house condenser is located in the cold accumulation equipment. The solar energy is converted into the electric energy, the electric energy is converted into the cold source to store ice, and the cold storage equipment supplies cold to the cold storage, so that the cost of the cold storage is reduced, and the energy is saved.'),
Document(metadata={'id': 1025, 'title': 'Energy-saving on-site hydrogen production hydrogenation station system', 'topic': 4}, page_content='Energy-saving on-site hydrogen production hydrogenation station system: The invention discloses an energy-saving on-site hydrogen production hydrogenation station system, which belongs to the field of hydrogen energy utilization and is characterized in that the energy conservation and consumption reduction of the hydrogen production and hydrogenation process are realized by combining ammonia decomposition on-site hydrogen production with a hydrogenation station and optimizing the system configuration, so that the storage and transportation cost of hydrogen is reduced. The high-pressure hydrogen cooling device is combined with the filling device, and the hydrogen is cooled after the filling device throttles, so that the temperature requirement of a cold source is reduced, and the COP of the refrigeration cycle is improved to save electricity consumption; the cooling capacity required by the cooling device comes from ammonia refrigeration cycle, the flexible adjustment of refrigeration work quality can be'),
Document(metadata={'id': 141, 'title': 'Complete shipment packaging structure of solar energy component', 'topic': 0}, page_content='in structure, does not need to package each accessory independently, reduces consumable of packaging materials and reduces cost.')],
'question': 'How does the system reduce the cost of cold storage?',
'answer': 'The system reduces the cost of cold storage by using solar energy to generate electric energy, which is then converted into a cold source for ice storage. This approach minimizes energy consumption and costs associated with cold storage.'}
selected_retriever_media.invoke("Renewable Energy")
[Document(metadata={'domain': 'energy-xprt', 'id': 2954, 'title': 'Renewable Energy Technology ( Renewable Energy) Training and...', 'url': 'energy-xprt.com/renewable-energy/renewable-energy-technology/training'}, page_content='Renewable Energy Technology ( Renewable Energy) Training and...: This course examines the role of a variety of technologies and strategies in achieving a sustainable urban environment. It covers opportunities for energy conservation, the shift toward decentralized power generation, several renewable energy technologies adapted for the urban environment, energy efficient buildings, alternative modes of transportation, urban planning, and government... By School of the Environment-University of Toronto based in Toronto, ONTARIO ( CANADA). This course investigates the principle types of renewable energy, as well as historical and technological challenges, and their place in the current global market. The place of renewable energy in society as a whole is examined through individual, political, corporate, and industry... By School of the Environment-University of Toronto based in Toronto, ONTARIO ( CANADA). Energy & Human development, Solar Energy, Wind Power 01 – 20 September, 2014 – Paderborn, Germany. The'),
Document(metadata={'domain': 'ecofriend', 'id': 1559, 'title': 'Solar Roofs: Everything You Might Want to Know About Them', 'url': 'ecofriend.com/how-do-solar-roofs-work.html'}, page_content='buildings to easily generate their renewable energy. It does so while cutting down carbon emissions significantly too! The integration of advanced construction techniques with renewables is breaking barriers encouraging more households and businesses to get aboard the sustainability train! This harmonious blend of construction and renewable energy technology enables homes and buildings to generate their own power, reducing reliance on the grid and decreasing carbon footprints. We encourage you to look further into whether this is the right option for you. You’ ll likely find that it is! EcoFriend.com – A Dr Prem Guides and Magazines Site. With 50+ web magazines and 5 million monthly readership, we invite you for Promotion, Review, Ranking and Marketing of your Content, Products and Services. Also connect with us for sale and purchase of websites. Contact Us Now.'),
Document(metadata={'domain': 'solarquarter', 'id': 5484, 'title': "Empowering India's Sustainable Future: An Overview of Juniper Green Energy's Mission and Vision in Solar Energy", 'url': 'solarquarter.com/2023/10/20/empowering-indias-sustainable-future-an-overview-of-juniper-green-energys-mission-and-vision-in-solar-energy-naresh-mansukhani-ceo-juniper-green-energy'}, page_content='fossil fuels by utilizing renewable energy sources such as solar power. This not only helps the environment but also improves air quality, improving human and animal health and well-being. Sustainability is closely tied to green energy since both aim to preserve the environment and mitigate the consequences of climate change. At Juniper Green Energy, we are deeply committed to building a sustainable and equitable future for all. Our focus is on contributing to a low-carbon economy, making significant strides in reducing greenhouse gas emissions, and conserving valuable resources through our renewable energy projects and initiatives. Our dedication to green energy has yielded impressive results, leading to the avoidance of 1.2 million metric tons of CO2 emissions per year. We recognize the urgency of mitigating climate change and are proud to play our part in this crucial endeavor. Moreover, our sustainable practices have allowed us to save over 120 million litres of water annually. Water is a precious')]
selected_retriever_patent.invoke("Renewable Energy")
[Document(metadata={'id': 955, 'title': 'Electricity storage hydrogen production method and application thereof', 'topic': 4}, page_content='with a renewable energy source generating device to establish a local energy source network; the invention solves the safety problems of hydrogen production, storage, transportation and use, provides a convenient and low-cost energy storage mode for renewable energy sources such as solar energy, wind energy, ocean energy and the like, and has wide application prospect in the fields of mobile equipment of vehicles, fixed equipment such as power stations and the like, chemical industry metallurgy and the like, and the fields related to hydrogen energy.'),
Document(metadata={'id': 536, 'title': 'Renewable energy driven zero-carbon efficient distributed energy supply system and operation method', 'topic': 1}, page_content='Renewable energy driven zero-carbon efficient distributed energy supply system and operation method: The invention discloses a zero-carbon efficient distributed energy supply system driven by renewable energy and an operation method thereof, wherein the system generates electricity by using renewable energy such as wind, light and the like to provide electric energy for users; hydrogen is produced by electrolyzing water and then is conveyed to hydrogen storage equipment to meet the demand of hydrogen load; the electric drive compression heat pump is used for cooling and heating, and the thermochemical energy storage device is used for storing and recycling solar heat; for the absorption heat pump system, high-temperature steam generated by the thermochemical heat storage system and high-temperature hot water generated by a fuel cell in the hydrogen energy storage system are used as a combined driving heat source, so that the energy utilization efficiency of the system is improved. The invention fully utilizes'),
Document(metadata={'id': 1657, 'title': 'Multi-energy cooperative energy station, method and storage medium', 'topic': 8}, page_content='and receiving energy utilization reservations of one or more users; and the cooperative controller is used for receiving the environmental data and the energy consumption data acquired by the acquisition component, calling a scheduling algorithm to control all equipment in the energy station to start and stop, wherein the computer predicts the photovoltaic power generation amount and the solar heat collection amount based on the environmental data, and carries out biomass production, photovoltaic power storage and photo-thermal heat storage according to the energy consumption data, the production plan and the reservation condition. Thus, embodiments of the present application enable dynamic synergy between multiple renewable energy sources and energy usage requirements by utilizing renewable energy sources.')]
selected_chain_media.invoke("Which organizations or countries are most active in cleantech technology?")
{'context': [Document(metadata={'domain': 'energyvoice', 'id': 3298, 'title': 'Who are the top innovators for the energy transition? - News for the Energy Sector', 'url': 'energyvoice.com/events/392726/who-are-the-top-innovators-for-the-energy-transition-articleisfree'}, page_content='environments. To explore this fast-changing landscape, Reuters Events we have drawn on their expertise as the world’ s leading provider of cleantech events to select 100 of the companies that we feel are leading innovation in the energy transition, split across key categories. Within the 10 companies named in each category, we have selected three for special mention. It is important to note that companies are not listed in any particular order: in the race to net-zero emissions, every ounce of innovation is worthy of praise. Download a complimentary copy of the report today: Click here Any top 100 list will attract scrutiny for the names that were included and those that were left out. This one, based on privileged insights gathered by the Reuters Events team while producing dozens of energy transition-focused events and reports in 2021, will doubtless be no different.'),
Document(metadata={'domain': 'iea', 'id': 3394, 'title': 'The State of Clean Technology Manufacturing – November 2023 Update – Analysis', 'url': 'iea.org/reports/the-state-of-clean-technology-manufacturing-november-2023-update'}, page_content='The State of Clean Technology Manufacturing – November 2023 Update – Analysis: Create a free IEA account to download our reports or subcribe to a paid service. Clean technology manufacturing is at the core of efforts to meet the world’ s climate, energy security and economic development goals. Deploying clean energy technologies at the pace required to put the world on a trajectory consistent with net zero emissions by mid-century will demand rapid expansion in manufacturing capacity, underpinned by secure, resilient and sustainable supply chains for their components and materials. This Energy Technology Perspectives Special Briefing provides a targeted update on recent progress in clean energy technology manufacturing in key regions. Covering five technologies – solar PV, wind, batteries, electrolysers and heat pumps – that will be critical to the energy transition, the analysis is focused on the areas of supply chains that are showing the greatest dynamism in response to recent policy and industrial'),
Document(metadata={'domain': 'iea', 'id': 3397, 'title': 'High-Level Dialogue: Diversifying Clean Technology Manufacturing', 'url': 'iea.org/events/high-level-dialogue-diversifying-clean-technology-manufacturing'}, page_content='High-Level Dialogue: Diversifying Clean Technology Manufacturing: Create a free IEA account to download our reports or subcribe to a paid service. On 6 November 2023 the International Energy Agency ( IEA) will host a high-level dialogue on the topic of Diversifying Clean Technology Manufacturing. Many elements of clean technology supply chains are highly geographically concentrated. This is particularly true at the manufacturing step, where four countries and the European Union account for 80-90% of global production capacity for key clean energy technologies like solar PV, wind, batteries, heat pumps and electrolysers. If governments are to make progress towards establishing secure, resilient and sustainable supply chains for these critical components of clean energy transitions, they will need carefully designed industrial strategies that unlock investment, while at the same time maintaining competitive markets and international trade. Strategic partnerships can help to bridge gaps in domestic supply chains')],
'question': 'Which organizations or countries are most active in cleantech technology?',
'answer': 'The most active organizations and countries in cleantech technology include four countries and the European Union, which account for 80-90% of global production capacity for key clean energy technologies such as solar PV, wind, batteries, heat pumps, and electrolysers.'}
selected_chain_patent.invoke("Which organizations or countries are most active in cleantech technology?")
{'context': [Document(metadata={'id': 197, 'title': 'Water quality monitoring device', 'topic': 0}, page_content='Water quality monitoring device: The utility model belongs to the technical field of marine product cultivation underwater, in particular to a water quality monitoring device, aiming at three problems of filter screen cleaning, battery replacement and counterweight adjustment, which are proposed by the background technology, the utility model provides a scheme which comprises a cleaning mechanism, a power supply mechanism and an adjusting mechanism, wherein the cleaning mechanism comprises a balance plate, a filter box fixedly connected with the center position of the bottom of the balance plate, a cleaning motor arranged at the center position of the top of the balance plate, a mounting frame for connecting an output shaft of the cleaning motor through the balance plate and a cleaning roller arranged at the bottom of the mounting frame, and the power supply mechanism comprises a protective cover fixedly connected with the top of the balance plate through bolts, a storage battery arranged in the protective'),
Document(metadata={'id': 1118, 'title': 'Water quality environment monitoring device', 'topic': 5}, page_content='Water quality environment monitoring device: The utility model discloses a water quality environment monitoring device, including water quality monitoring instrument, still include the flotation tank body, flotation tank body upper portion is equipped with the showy gasbag on encircleing its outer wall, the fixed surface is provided with the solar photovoltaic board on the showy gasbag, the internal battery that has of flotation tank body bottom, battery electric connection main control board, be equipped with photovoltaic control device on the main control board, the internal upper portion of flotation tank is equipped with the extrusion gasbag, the extrusion gasbag embeds there is water quality monitoring instrument, water quality monitoring instrument passes through line connection wireless transceiver, wireless transceiver sets up in the top of the flotation tank body. The utility model overcomes prior art shortcoming, simple structure is reasonable, stably floats in the surface of water, through the'),
Document(metadata={'id': 154, 'title': 'Eutrophic water body ecological management device', 'topic': 0}, page_content='Eutrophic water body ecological management device: The invention discloses an ecological treatment device for eutrophic water, which belongs to the technical field of ecological treatment devices for water and comprises a solar aerator body, wherein a supporting plate is rotatably connected to the solar aerator body, a solar panel is mounted on the supporting plate, a power structure is arranged on the supporting plate, an auxiliary structure is arranged on the power structure, a cleaning structure is arranged on the supporting plate, the power structure drives the cleaning structure to reciprocate, and an adjusting structure is arranged on the solar aerator body; through setting up power structure and clearance structure, the mount frame drives clearance board and clearance brush and carries out synchronous motion when removing to make the clearance brush can clear up the solar panel surface, avoid the dust adhesion on the solar panel surface, guarantee the clean and tidy nature on solar panel surface,')],
'question': 'Which organizations or countries are most active in cleantech technology?',
'answer': "I don't know."}
Only considering bge-m3 embeddeings here
chains = {}
for collection_name, collection in collections.items():
store = collection_to_store(collection_name, embedding_models[collection_name.split("_")[0]])
retriever = store_to_retriever(store)
chain = create_qa_chain(retriever)
chains[collection_name] = chain
chains.keys()
dict_keys(['bge-m3_recursive_256_media', 'bge-m3_recursive_256_patent', 'bge-m3_recursive_1024_media', 'bge-m3_recursive_1024_patent', 'bge-m3_semantic_media', 'bge-m3_semantic_patent'])
Evaluation¶
Because we have many hyperparameters such as chunk size, prompts etc. to tune and different strategies to try we will use the RAGAS (RAG Assesment) framework to evaluate our pipeline. RAGAS is a framework that allows you to evaluate your RAG pipeline with an LLM as a judge and other metrics that also utilize embedding models. We will go more into detail on the metrics later on.
Before we can start the evaluation we need to define the evaluation questions and their ground truth answers. For this we will use the provided evaluation questions. To increase our question pool we will also generate some additional question and answer pairs based on a random chunk and utilizing the LLM (GPT-4o) to generate the question and answer.
human_eval_df.head()
| example_id | question | relevant_section | answer | url | |
|---|---|---|---|---|---|
| 0 | 1 | What is the innovation behind Leclanché's new method to produce lithium-ion batteries? | Leclanché said it has developed an environmentally friendly way to produce lithium-ion (Li-ion) ... | Leclanché's innovation is using a water-based process instead of highly toxic organic solvents t... | sgvoice.energyvoice.com/strategy/technology/23971/leclanches-new-disruptive-battery-boosts-energ... |
| 1 | 2 | What is the EU’s Green Deal Industrial Plan? | The Green Deal Industrial Plan is a bid by the EU to make its net zero industry more competitive... | The EU’s Green Deal Industrial Plan aims to enhance the competitiveness of its net zero industry... | sgvoice.energyvoice.com/policy/25396/eu-seeks-competitive-boost-with-green-deal-industrial-plan |
| 2 | 3 | What is the EU’s Green Deal Industrial Plan? | The European counterpart to the US Inflation Reduction Act (IRA) aims to create an environment t... | The EU’s Green Deal Industrial Plan aims to enhance the competitiveness of its net zero industry... | pv-magazine.com/2023/02/02/european-commission-introduces-green-deal-industrial-plan |
| 3 | 4 | What are the four focus areas of the EU's Green Deal Industrial Plan? | The new plan is fundamentally focused on four areas, or pillars: the regulatory environment, acc... | The four focus areas of the EU's Green Deal Industrial Plan are the regulatory environment, acce... | sgvoice.energyvoice.com/policy/25396/eu-seeks-competitive-boost-with-green-deal-industrial-plan |
| 4 | 5 | When did the cooperation between GM and Honda on fuel cell vehicles start? | What caught our eye was a new hookup between GM and Honda. Honda was also hammering away at the ... | July 2013 | cleantechnica.com/2023/05/08/general-motors-seizes-the-fuel-cell-moment-with-green-hydrogen |
our curated QA pairs for Media dataset
human_eval_df_media.head()
| id | relevant_section | question | answer | url | category | groundedness_score | groundedness_eval | relevance_score | relevance_eval | standalone_score | standalone_eval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What is green hydrogen, and how is it produced? | Green hydrogen is a sustainable energy carrier produced by water electrolysis using renewable en... | azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The provided context clearly explains what green hydrogen is, how it is produced through water e... | 3.0 | The question is straightforward and clear, inquiring about a specific concept (green hydrogen) a... | 5.0 | The question is self-explanatory and does not rely on additional context to be understood. It is... |
| 1 | 1 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What is the significance of the Neom project in Saudi Arabia as a pioneering example of green hy... | The Neom project, a partnership between ACWA Power, Air Products, and NEOM, harnesses solar and ... | azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The context provides a comprehensive overview of the significance of green hydrogen and its inte... | 4.0 | The Neom project is a key concept in the context of green hydrogen integration, and understandin... | 4.0 | The question appears to rely on additional knowledge about the Neom project and its connection t... |
| 2 | 2 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What is the significance of the Energy Transitions Commission's report on making clean electrifi... | The report highlights the need for a 30-year transition to electrify the global economy, providi... | azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The question regarding the significance of the Energy Transitions Commission's report "Making Cl... | 3.0 | This question appears to be relevant to environmental sustainability and energy policy, which mi... | 5.0 | The question refers to a specific institution (Energy Transitions Commission) and a specific con... |
| 3 | 3 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | How does green hydrogen compare to direct use of electricity in terms of energy efficiency? | Green hydrogen production through electrolysis is less energy-efficient than direct use of elect... | azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 4.0 | The question requires an in-depth analysis of the context provided, specifically focusing on the... | 4.0 | This question is relevant to NLP developers building applications that may use energy-intensive ... | 5.0 | The question implies that there might be some universal or general information about green hydro... |
| 4 | 4 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What are some of the examples of pilot projects testing the viability of green hydrogen in vario... | Various pilot projects are testing green hydrogen's viability in energy systems, demonstrating i... | azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The question can be answered unambiguously based on the provided context, which describes variou... | 3.0 | This question appears to be focused on environmental sustainability and energy systems, which is... | 4.0 | The question asks for specific examples of pilot projects, which implies the existence of a cont... |
our curated QA pairs for Media dataset
human_eval_df_patent.head()
| id | relevant_section | question | answer | category | groundedness_score | groundedness_eval | relevance_score | relevance_eval | standalone_score | standalone_eval | title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Distributed photovoltaic energy storage refrigeration house systemThe utility model discloses a ... | How does the system reduce the cost of cold storage? | The system reduces the cost of cold storage by converting solar energy into electric energy, whi... | Sustainability & Technological Innovation Questions | 5.0 | The question can be answered based on the given context, and the answer is clear and unambiguous. | 4.0 | The question is concise and to the point, directly asking about a specific aspect of how the Hug... | 5.0 | The question appears to be related to a general concept of cost reduction in the context of data... | Distributed photovoltaic energy storage refrigeration house system |
| 1 | 1 | Water path manifold structure of hydrogen energy automobile electric pileThe utility model discl... | What is good about the utility model? | The utility model has a simple structure that is easy to assemble and disassemble. | Analytical & Explanatory Questions | 5.0 | The question "What is good about the utility model?" is somewhat ambiguous without further clari... | 3.0 | The question is very short and to the point, but it lacks context and detail about the specific ... | 5.0 | The question seems to be asking about a general property or characteristic of a "utility model",... | Water path manifold structure of hydrogen energy automobile electric pile |
| 2 | 2 | Active power control method of water-fire-wind-solar energy storage multi-energy complementary i... | What is the purpose of using power supply with better regulation performance to compensate for p... | The power supply with better regulation performance is used to carry out compensation regulation... | Government & Corporate Initiatives | 5.0 | The question is clearly answerable by understanding the purpose of using power supply with bette... | 3.0 | The question is directly related to power supply regulation and its impact on the performance of... | 5.0 | The question assumes knowledge of power supplies in general, specifically their regulation perfo... | Active power control method of water-fire-wind-solar energy storage multi-energy complementary i... |
| 3 | 3 | Water conservancy and hydropower engineering construction tunnel internal flow guiding and drain... | What is the water conservancy and hydropower engineering construction hole inner diversion drain... | The utility model discloses a water conservancy and hydropower engineering construction hole inn... | Sustainability & Technological Innovation Questions | 5.0 | The question is clearly answerable with the given context, as it describes a specific inner dive... | 3.0 | The question seems to be about a specific technical term, which may be useful for machine learni... | 5.0 | The question contains technical terms and a specific reference to a concept that appears to be w... | Water conservancy and hydropower engineering construction tunnel internal flow guiding and drain... |
| 4 | 4 | Medium-and-long-term electric power quantity balancing method for electric power system containi... | What is the main consideration for the balancing method, in addition to safety and economy? | The seasonal characteristics of renewable energy sources in time and the coordination problem of... | Analytical & Explanatory Questions | 4.0 | The context provides a detailed description of a method for balancing electric quantity in a pow... | 4.0 | The question is asking about a specific aspect of the balancing method, which is a common techni... | 5.0 | The question does not provide a specific context, and the balancing method is a general concept ... | Medium-and-long-term electric power quantity balancing method for electric power system containi... |
As we are only given questions and the relevant sections of the articles we need to generate the answers to the questions. We will use the LLM (GPT-4o) to generate the answers to the questions.
import time
from openai import RateLimitError
def generate_eval_answers(df: pd.DataFrame) -> pd.DataFrame:
answer_generation_prompt = """Answer the following question based on the article:
Question: {question}
Article: {article}
"""
answer_generation_chain = ChatPromptTemplate.from_template(answer_generation_prompt) | llm
for i, row in tqdm(df.iterrows(), total=len(df)):
while True:
try:
response = answer_generation_chain.invoke({
"question": row["question"],
"article": row["relevant_section"]
}).content
df.at[i, "ground_truth"] = response
break
except RateLimitError as e:
print(f"Rate limit error at row {i}, retrying in 20 seconds...")
time.sleep(20)
except Exception as e:
print(f"Other error at row {i}: {e}")
df.at[i, "ground_truth"] = "ERROR"
break
return df
We are not using the seeddata , only using our curated evaluation data for evaluation purpose:
# if (silver_folder / "human_eval.csv").exists():
# human_eval_df = pd.read_csv(silver_folder / "human_eval.csv")
# else:
# human_eval_df = generate_eval_answers(human_eval_df)
# human_eval_df.to_csv(silver_folder / "human_eval.csv", index=False)
# human_eval_df.head()
if (silver_folder / "human_eval_media.csv").exists():
human_eval_df_media = pd.read_csv(silver_folder / "human_eval_media.csv")
else:
human_eval_df_media = generate_eval_answers(human_eval_df_media)
human_eval_df_media.to_csv(silver_folder / "human_eval_media.csv", index=False)
human_eval_df_media.head()
| id | relevant_section | question | answer | url | category | groundedness_score | groundedness_eval | relevance_score | relevance_eval | standalone_score | standalone_eval | ground_truth | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What is green hydrogen, and how is it produced? | Green hydrogen is a sustainable energy carrier produced by water electrolysis using renewable en... | azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The provided context clearly explains what green hydrogen is, how it is produced through water e... | 3.0 | The question is straightforward and clear, inquiring about a specific concept (green hydrogen) a... | 5.0 | The question is self-explanatory and does not rely on additional context to be understood. It is... | Green hydrogen is produced by water electrolysis using renewable energy sources. This process en... |
| 1 | 1 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What is the significance of the Neom project in Saudi Arabia as a pioneering example of green hy... | The Neom project, a partnership between ACWA Power, Air Products, and NEOM, harnesses solar and ... | azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The context provides a comprehensive overview of the significance of green hydrogen and its inte... | 4.0 | The Neom project is a key concept in the context of green hydrogen integration, and understandin... | 4.0 | The question appears to rely on additional knowledge about the Neom project and its connection t... | The significance of the Neom project in Saudi Arabia as a pioneering example of green hydrogen i... |
| 2 | 2 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What is the significance of the Energy Transitions Commission's report on making clean electrifi... | The report highlights the need for a 30-year transition to electrify the global economy, providi... | azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The question regarding the significance of the Energy Transitions Commission's report "Making Cl... | 3.0 | This question appears to be relevant to environmental sustainability and energy policy, which mi... | 5.0 | The question refers to a specific institution (Energy Transitions Commission) and a specific con... | The significance of the Energy Transitions Commission's report on making clean electrification p... |
| 3 | 3 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | How does green hydrogen compare to direct use of electricity in terms of energy efficiency? | Green hydrogen production through electrolysis is less energy-efficient than direct use of elect... | azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 4.0 | The question requires an in-depth analysis of the context provided, specifically focusing on the... | 4.0 | This question is relevant to NLP developers building applications that may use energy-intensive ... | 5.0 | The question implies that there might be some universal or general information about green hydro... | Based on the article, green hydrogen production through water electrolysis using renewable energ... |
| 4 | 4 | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | What are some of the examples of pilot projects testing the viability of green hydrogen in vario... | Various pilot projects are testing green hydrogen's viability in energy systems, demonstrating i... | azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions | 5.0 | The question can be answered unambiguously based on the provided context, which describes variou... | 3.0 | This question appears to be focused on environmental sustainability and energy systems, which is... | 4.0 | The question asks for specific examples of pilot projects, which implies the existence of a cont... | Some of the examples of pilot projects testing the viability of green hydrogen in various energy... |
if (silver_folder / "human_eval_patent.csv").exists():
human_eval_df_patent = pd.read_csv(silver_folder / "human_eval_patent.csv")
else:
human_eval_df_patent = generate_eval_answers(human_eval_df_patent)
human_eval_df_patent.to_csv(silver_folder / "human_eval_patent.csv", index=False)
human_eval_df_patent.head()
| id | relevant_section | question | answer | category | groundedness_score | groundedness_eval | relevance_score | relevance_eval | standalone_score | standalone_eval | title | ground_truth | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Distributed photovoltaic energy storage refrigeration house systemThe utility model discloses a ... | How does the system reduce the cost of cold storage? | The system reduces the cost of cold storage by converting solar energy into electric energy, whi... | Sustainability & Technological Innovation Questions | 5.0 | The question can be answered based on the given context, and the answer is clear and unambiguous. | 4.0 | The question is concise and to the point, directly asking about a specific aspect of how the Hug... | 5.0 | The question appears to be related to a general concept of cost reduction in the context of data... | Distributed photovoltaic energy storage refrigeration house system | The system reduces the cost of cold storage by using distributed photovoltaic (solar) energy to ... |
| 1 | 1 | Water path manifold structure of hydrogen energy automobile electric pileThe utility model discl... | What is good about the utility model? | The utility model has a simple structure that is easy to assemble and disassemble. | Analytical & Explanatory Questions | 5.0 | The question "What is good about the utility model?" is somewhat ambiguous without further clari... | 3.0 | The question is very short and to the point, but it lacks context and detail about the specific ... | 5.0 | The question seems to be asking about a general property or characteristic of a "utility model",... | Water path manifold structure of hydrogen energy automobile electric pile | Based on the article, the good aspects (advantages) of the utility model are:\n\n- It effectivel... |
| 2 | 2 | Active power control method of water-fire-wind-solar energy storage multi-energy complementary i... | What is the purpose of using power supply with better regulation performance to compensate for p... | The power supply with better regulation performance is used to carry out compensation regulation... | Government & Corporate Initiatives | 5.0 | The question is clearly answerable by understanding the purpose of using power supply with bette... | 3.0 | The question is directly related to power supply regulation and its impact on the performance of... | 5.0 | The question assumes knowledge of power supplies in general, specifically their regulation perfo... | Active power control method of water-fire-wind-solar energy storage multi-energy complementary i... | According to the article, the purpose of using power supplies with better regulation performance... |
| 3 | 3 | Water conservancy and hydropower engineering construction tunnel internal flow guiding and drain... | What is the water conservancy and hydropower engineering construction hole inner diversion drain... | The utility model discloses a water conservancy and hydropower engineering construction hole inn... | Sustainability & Technological Innovation Questions | 5.0 | The question is clearly answerable with the given context, as it describes a specific inner dive... | 3.0 | The question seems to be about a specific technical term, which may be useful for machine learni... | 5.0 | The question contains technical terms and a specific reference to a concept that appears to be w... | Water conservancy and hydropower engineering construction tunnel internal flow guiding and drain... | The water conservancy and hydropower engineering construction hole inner diversion drainage stru... |
| 4 | 4 | Medium-and-long-term electric power quantity balancing method for electric power system containi... | What is the main consideration for the balancing method, in addition to safety and economy? | The seasonal characteristics of renewable energy sources in time and the coordination problem of... | Analytical & Explanatory Questions | 4.0 | The context provides a detailed description of a method for balancing electric quantity in a pow... | 4.0 | The question is asking about a specific aspect of the balancing method, which is a common techni... | 5.0 | The question does not provide a specific context, and the balancing method is a general concept ... | Medium-and-long-term electric power quantity balancing method for electric power system containi... | In addition to safety and economy, the main consideration for the balancing method is the seaso... |
We will now generate some synthetic questions and answers based on random chunks. So we will give the LLM a random chunk and ask it to generate a question and answer based on the chunk.
def generate_synthetic_qa_pairs(documents: List[Document], n: int = 10) -> List[str]:
synthetic_questions = []
documents = np.random.choice(documents, n)
question_generation_prompt = """Generate a short and general question based on the following news article:
Article: {article}
"""
question_generation_chain = ChatPromptTemplate.from_template(question_generation_prompt) | llm
answer_geneation_prompt = """Answer the following question based on the article:
Question: {question}
Article: {article}
"""
answer_generation_chain = ChatPromptTemplate.from_template(answer_geneation_prompt) | llm
for document in tqdm(documents):
element = {}
content = document.page_content
element["relevant_section"] = content
if "url" in document.metadata and document.metadata["url"]:
element["url"] = document.metadata["url"]
else:
element["url"] = None
question = question_generation_chain.invoke({"article": content}).content
element["question"] = question
answer = answer_generation_chain.invoke({"question": question, "article": content}).content
element["ground_truth"] = answer
synthetic_questions.append(element)
return pd.DataFrame(synthetic_questions)
if not (silver_folder / "synthetic_media_eval.csv").exists():
synthetic_eval_df_media = generate_synthetic_qa_pairs(chunks["recursive_1024_media"], 25)
synthetic_eval_df_media.to_csv(silver_folder / "synthetic_media_eval.csv", index=False)
else:
synthetic_eval_df_media = pd.read_csv(silver_folder / "synthetic_media_eval.csv", index_col=0)
synthetic_eval_df_media.head()
| url | question | ground_truth | |
|---|---|---|---|
| relevant_section | |||
| and managed, along with suggestions for local projects that could be supported. Mike Rutgers, development director at Low Carbon said: “ This is a major milestone for Low Carbon as we approach our planning submission. We’ ve taken into consideration the feedback provided by the local community earlier this year which, alongside our ongoing technical and environmental surveys, has helped us to refine our proposals to those you see today. “ The community asked us to work with other developers in the area to reduce cumulative impacts. You’ ll see from our new plans that we’ re proposing to do just that. By seeking to align our Grid Connect Route with other proposals in the area, we hope to pursue the most efficient way of working and minimise any adverse impacts on the community. ” The Gate Burton Energy Park project website has been updated to include the refined proposals for the site, which sits in the West Lindsey District near Gate Burton, Knaith Park and Willingham-by-Stow. Additionally, more than 7,000 | solarpowerportal.co.uk/low_carbon_launches_second_consultation_for_500mw_gate_burton_solar_and_sto | What are the key updates and community considerations in the latest proposals for the Gate Burto... | The key updates and community considerations in the latest proposals for the Gate Burton Energy ... |
| heat production in much of the world. '' So while solar power remains marginal in the global energy mix, we are forced to see that it is growing exponentially and worldwide. Indeed, solar energy seems to be the great energy of the future due to its multiple... China’ s National Energy Administration ( NEA) announced a total PV installation target of 18.1GW for 2016, within which 12.6 GW are for standard types of PV installations, including both ground-mounted PV and distributed PV... Developer: Sol Source Partners Inc. Location: Shelburne, Ont. Centre Dufferin District High School Size: 50 kW ( 20 kW facade, 30 kW flat roof) HELIENE Modules: 200 – HELIENE 60M 250Wp Modules The Upper Grand District School Board is committed to promoting and teaching environmental stewardship and the Centre Dufferin District High School in Shelburne is a leader in... By Heliene Inc. Developer: Carmanah Technologies ( EPC Provider) for Powerstream Project Location: Markham, Ontario Size: 350 kW HELIENE Modules: 1167 – HELIENE | energy-xprt.com/solar-energy/photovoltaic-technology/articles | How is solar power growing and being implemented worldwide? | Solar power is experiencing exponential growth worldwide, despite currently being a marginal par... |
| The new cutting-edge technology making generators greener: The latest innovation in generator technology, the flywheel power system, is tipped to revolutionise the industry, making systems greener, cleaner and more cost effective. The flywheel power system is a piece of next generation kit that works by capturing energy that is normally wasted during a machine or vehicles use and storing it in a high-speed energy storage flywheel. The stored energy can then be cycled back through the machine and used in its running, saving fuel, improving performance and reducing emissions. Genny Hire, a diesel-generator rental company who serve the North East of Scotland, is excited to have put in an order with Silverstone-based PUNCH Flybrid for their flywheel power system. As to the thought process behind the deal, Lorna from Genny Hire said: “ My main reason for making an ongoing investment in this technology is that both our customers and we ourselves want to reduce our energy consumption. “ We always want to provide the | energyvoice.com/promoted/469987/the-new-cutting-edge-technology-making-generators-greener | How is the new flywheel power system expected to make generators greener and more efficient? | The new flywheel power system is expected to make generators greener and more efficient by captu... |
| including how DOE’ s investments can be most impactful in promoting workforce development, and environmental and energy justice through the EGS Pilot Demonstrations Program. These demonstration projects could help advance DOE’ s goals to deploy more than 60 gigawatts ( GW) of geothermal electricity-generating capacity by 2050—resulting in clean, reliable power for 129 million American homes and businesses, and contributing to President Biden’ s goals for a net-zero emissions economy. Source: Department of Energy The U.S. Department of Energy has issued a Request for Information ( RFI) for enhanced geothermal systems ( EGS) pilot demonstration projects. This RFI is part of the execution of President Biden’ s Bipartisan Infrastructure Law that authorizes the DOE to support four selected pilot projects with USD 84 million. Responses to the RFI must be submitted via email to BIL EGSPilotDemos @ ee.doe.gov by 13 May 2022, 5pm ET. EGS projects can significantly increase geothermal energy deployment throughout the | thinkgeoenergy.com/doe-offers-usd-84-million-funding-for-egs-projects | How is the Department of Energy promoting workforce development and environmental justice throug... | The Department of Energy (DOE) is promoting workforce development and environmental justice thro... |
| to help reduce its burden on the planet. This International Women's Day, AZoCleantech spoke to inspiring women who have made a difference in the Clean Technology field. For this interview, we spoke to Dr. Adina Rom, the Executive Director of ETH for Development. | azocleantech.com/suppliers.aspx?SupplierID=1309 | How are women making an impact in the field of Clean Technology? | The article highlights that women are making an impact in the field of Clean Technology by takin... |
if not (silver_folder / "synthetic_patent_eval.csv").exists():
synthetic_eval_df_patent = generate_synthetic_qa_pairs(chunks["recursive_1024_patent"], 25)
synthetic_eval_df_patent.to_csv(silver_folder / "synthetic_patent_eval.csv", index=False)
else:
synthetic_eval_df_patent = pd.read_csv(silver_folder / "synthetic_patent_eval.csv", index_col=0)
synthetic_eval_df_patent.head()
| url | question | ground_truth | |
|---|---|---|---|
| relevant_section | |||
| Hub type wind power generation tower with vertical hollow shaft sleeve as hub: The invention discloses a hub type wind power generation tower with a vertical hollow shaft sleeve, which comprises a tower barrel main body, an engine room, a vertical hollow shaft, a hub, a paddle main body, a transmission mechanism and a generator, wherein the vertical hollow shaft sleeve is used as a hub; the vertical hollow shaft is arranged at the top end of the engine room, and a wheel dice is sleeved on the vertical hollow shaft and is connected with the vertical hollow shaft through a bearing; the paddle main body is installed on the outer side wall of the hub, the transmission mechanism is connected with the hub and the generator, the generator is installed in the engine room, the hub drives the transmission mechanism to rotate, the generator is further driven to rotate, and alternating current is output. The power generation tower adopts a mode of combining the vertical hollow shaft and the hub, has enough supporting | NaN | What are the key components and functions of the hub type wind power generation tower with a ver... | The key components and functions of the hub type wind power generation tower with a vertical hol... |
| High-permeability photovoltaic access distribution network architecture and protection configuration method: The invention discloses a high-permeability photovoltaic access distribution network architecture and protection configuration method, which comprises the following steps: step 1, analyzing a typical architecture mode of a high-permeability power distribution network and power generation characteristic modeling of a photovoltaic power supply, and analyzing a fault mechanism and fault characteristics of the high-permeability photovoltaic power distribution network; step 2, analyzing the mode of photovoltaic access to a 380V distribution network alternating current bus, and generally selecting a two-stage selection two-stage access mode; step 3, modeling photovoltaic power generation characteristics by aiming at a direct power generation mode that the photovoltaic cell panel converts solar energy into electric energy; step 4, analyzing a fault mechanism when the AC/DC port line has a fault, and | NaN | What are the key features and benefits of the high-permeability photovoltaic access distribution... | The key features and benefits of the high-permeability photovoltaic access distribution network ... |
| Hybrid solar thermal and chemical vehicle configurations for space mining applications: Solar thermal and chemical hybrid rocket configurations for mining and other space applications are disclosed. One aspect is a rocket propulsion system configured to provide rocket thrust, including a solar absorber, a rocket nozzle, and a solar power collection system configured to collect solar energy from the sun, generate an energy beam from the collected sunlight, heat the solar absorber to transfer heat to one or more pressurized propulsive gases, and expel the heated pressurized propulsive gases through a rocket nozzle. A solar absorber can be formed from a granular collection or agglomeration of solids (e.g., of beads), which can be layered with more transparent layer(s) above and more absorbing layer(s) below to create a temperature profile in propellant(s) flowing through the absorber. A hybrid motor can provide an energy (e.g., solar) absorber for absorbing and transferring radiative energy as well as a | NaN | What are the advantages of using hybrid solar thermal and chemical rocket systems for space mini... | The advantages of using hybrid solar thermal and chemical rocket systems for space mining applic... |
| is through rotating the lead screw, and the ejector pad is under the effect that sets up the groove, and lead screw drive ejector pad drives the ejector pad and drives and place the board and rise, and drive rack removes, places the restriction to flat solar energy removal and release the frame with placing, easy operation, the dismouting of the flat solar energy of being convenient for improves solar energy dismouting efficiency. | NaN | How does the described mechanism improve the efficiency of removing flat solar panels? | The described mechanism improves the efficiency of removing flat solar panels by using a lead sc... |
| frame in a sliding manner, one end of the threaded rod is fixedly connected with a moving plate, inserting plates are fixedly connected to side walls of both ends of the moving plate, and inserting grooves corresponding to the inserting plates are formed in one side of the assembling blocks. | NaN | What is the function of the threaded rod and moving plate in the described assembly? | The function of the threaded rod and moving plate in the described assembly is to enable control... |
question_length = {
"human": human_eval_df_media["question"].map(len),
"synthetic": synthetic_eval_df_media["question"].map(len)
}
sns.histplot(question_length, kde=True)
plt.title("Question Length Distribution")
plt.xlabel("Question Length")
plt.ylabel("Count")
plt.show()
eval_df_media = pd.concat([human_eval_df_media, synthetic_eval_df_media], ignore_index=True)
eval_df_media["is_synthetic"] = eval_df_media["relevant_section"].isna()
eval_df_media["is_synthetic"].value_counts()
| count | |
|---|---|
| is_synthetic | |
| False | 200 |
| True | 25 |
question_length = {
"human": human_eval_df_patent["question"].map(len),
"synthetic": synthetic_eval_df_patent["question"].map(len)
}
sns.histplot(question_length, kde=True)
plt.title("Question Length Distribution")
plt.xlabel("Question Length")
plt.ylabel("Count")
plt.show()
eval_df_patent = pd.concat([human_eval_df_patent, synthetic_eval_df_patent], ignore_index=True)
eval_df_patent["is_synthetic"] = eval_df_patent["relevant_section"].isna()
eval_df_patent["is_synthetic"].value_counts()
| count | |
|---|---|
| is_synthetic | |
| False | 200 |
| True | 25 |
Now we have doubled the number of questions and answers. However, we can see that our synthetic questions are slightly longer than the provided questions which could mean that they are slightly easier to answer. This potential bias should be taken into account when evaluating the pipeline.
RAGAS Metrics¶
RAGAS provides a variety of metrics to evaluate the performance of a RAG pipeline. Here are some of the key metrics we will be using and how they are calculated:
- Answer Relevancy: This metric measures the relevance of the generated answer to the user query. The Answer Relevancy is defined as the mean cosine similartiy of the original question to a number of artifical questions, which where generated (reverse engineered) based on the answer.
- Answer Correctness: This metric measures the correctness of the generated answer. Answer correctness encompasses two critical aspects: semantic similarity between the generated answer and the ground truth, as well as factual similarity. These aspects are combined using a weighted scheme to formulate the answer correctness score.
- Faithfulness: This metric measures how well the generated answer is faithful to the retrieved chunks. The generated answer is regarded as faithful if all the claims that are made in the answer can be inferred from the given context. To calculate this a set of claims from the generated answer is first identified. Then each one of these claims are cross checked with given context to determine if it can be inferred from given context or not.
- Context Relevancy: This metric measures the relevance of the retrieved chunks to the user query. Ideally, the retrieved context should exclusively contain essential information to address the provided query. To compute this, we initially estimate the number of sentences within the retrieved context that are relevant for answering the given question and devide it by the total number of sentences in the retrieved context.

For this to work we create a test dataset for each of our RAG pipelines that contains the evaluation questions and their ground truth answers. We then run all the questions through our RAG pipeline and store the generated answers and the retrieved chunks. We can then use this test dataset to calculate the RAGAS metrics.
from datasets import Dataset
import json
from tqdm import tqdm
datasets_folder = gold_folder / "datasets"
datasets_folder.mkdir(exist_ok=True)
# Recursive cleaning for PyArrow compatibility
def clean_for_arrow(value):
if isinstance(value, list):
return [clean_for_arrow(v) for v in value]
if value is None or isinstance(value, float):
return ""
return str(value)
def convert_to_strings(datapoints):
return {key: clean_for_arrow(val) for key, val in datapoints.items()}
def get_or_create_eval_dataset(name: str, df: pd.DataFrame, chain) -> Dataset:
dataset_file = datasets_folder / f"{name}_dataset.json"
if dataset_file.exists():
with open(dataset_file, "r") as file:
raw_data = json.load(file)
cleaned_data = convert_to_strings(raw_data)
dataset = Dataset.from_dict(cleaned_data)
print(f"Loaded {name} dataset from {dataset_file}")
else:
datapoints = {
"question": df["question"].tolist(),
"answer": [],
"contexts": [],
"ground_truth": df["ground_truth"].tolist(),
"context_urls": [],
"category": df["category"].tolist() if "category" in df.columns else ["" for _ in df.index]
}
for question in tqdm(datapoints["question"], desc=f"Generating {name}"):
result = chain.invoke(question)
datapoints["answer"].append(result.get("answer", ""))
datapoints["contexts"].append([doc.page_content for doc in result.get("context", [])])
datapoints["context_urls"].append([doc.metadata.get("url", "") for doc in result.get("context", [])])
# Clean data before writing to JSON and converting to Dataset
cleaned_data = convert_to_strings(datapoints)
with open(dataset_file, "w") as file:
json.dump(cleaned_data, file)
dataset = Dataset.from_dict(cleaned_data)
print(f"Saved {name} dataset to {dataset_file}")
return dataset
def plot_llm_eval(name: str, eval_results: pd.DataFrame):
# select only the float64 columns (assuming these are the RAGAS metrics)
ragas_metrics_data = (eval_results
.select_dtypes(include=[np.float64]))
# boxplot of distributions
sns.boxplot(data=ragas_metrics_data, palette="Set2")
plt.title(f'{name}: Distribution of RAGAS Evaluation Metrics')
plt.ylabel('Scores')
plt.xlabel('Metrics')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# barplot of means
means = ragas_metrics_data.mean()
plt.figure(figsize=(14, 8))
sns.barplot(x=means.index, y=means, palette="Set2")
plt.title(f'{name}: Mean of RAGAS Evaluation Metrics')
plt.ylabel('Mean Scores')
plt.xlabel('Metrics')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
def plot_multiple_evals(eval_results: Dict[str, pd.DataFrame]):
# combine the results
full_results = []
for name, results in eval_results.items():
results['name'] = name
full_results.append(results)
full_results = pd.concat(full_results, ignore_index=True)
full_results = full_results.sort_values(by='name')
# select only the float64 columns (assuming these are the RAGAS metrics)
ragas_metrics_data = full_results.select_dtypes(include=[np.float64])
ragas_metrics_data['name'] = full_results['name']
# boxplot of distributions
plt.figure(figsize=(14, 8))
sns.boxplot(x='variable', y='value', hue='name', data=pd.melt(ragas_metrics_data, id_vars='name'), palette="Set2")
plt.title('Distribution of RAGAS Evaluation Metrics by Model')
plt.ylabel('Scores')
plt.xlabel('Metrics')
plt.xticks(rotation=45)
plt.legend(title='Model')
plt.tight_layout()
plt.show()
# barplot of means
means = ragas_metrics_data.groupby('name').mean().reset_index()
means_melted = pd.melt(means, id_vars='name')
plt.figure(figsize=(14, 8))
sns.barplot(x='variable', y='value', hue='name', data=means_melted, palette="Set2")
plt.title('Mean of RAGAS Evaluation Metrics by Model')
plt.ylabel('Mean Scores')
plt.xlabel('Metrics')
plt.xticks(rotation=45)
plt.legend(title='Model')
plt.tight_layout()
plt.show()
def plot_multiple_evals_by_category(eval_results: Dict[str, pd.DataFrame]):
# Combine all results into one DataFrame
full_results = pd.concat(eval_results.values(), ignore_index=True)
# Drop rows without a valid 'category'
if "category" not in full_results.columns:
print("No 'category' column found in the provided data.")
return
full_results = full_results.dropna(subset=["category"])
full_results = full_results[full_results["category"].astype(str).str.strip() != ""]
if full_results.empty:
print("All rows are missing 'category'. Nothing to plot.")
return
# Select RAGAS metric columns (float64) and keep category
ragas_metrics_data = full_results.select_dtypes(include=[np.float64])
ragas_metrics_data["category"] = full_results["category"]
# Boxplot of distributions by category
plt.figure(figsize=(14, 8))
sns.boxplot(
x='variable', y='value', hue='category',
data=pd.melt(ragas_metrics_data, id_vars='category'),
palette="Set2"
)
plt.title('Distribution of RAGAS Evaluation Metrics by Category')
plt.ylabel('Scores')
plt.xlabel('Metrics')
plt.xticks(rotation=45)
plt.legend(title='Category')
plt.tight_layout()
plt.show()
# Barplot of means by category
means = ragas_metrics_data.groupby('category').mean().reset_index()
means_melted = pd.melt(means, id_vars='category')
plt.figure(figsize=(14, 8))
sns.barplot(
x='variable', y='value', hue='category',
data=means_melted, palette="Set2"
)
plt.title('Mean of RAGAS Evaluation Metrics by Category')
plt.ylabel('Mean Scores')
plt.xlabel('Metrics')
plt.xticks(rotation=45)
plt.legend(title='Category')
plt.tight_layout()
plt.show()
# selected_dataset = get_or_create_eval_dataset("selected", eval_df, selected_chain)
selected_dataset_media = get_or_create_eval_dataset("selected_media", eval_df_media, selected_chain_media)
Loaded selected_media dataset from data/data_new/gold/datasets/selected_media_dataset.json
selected_dataset_patent = get_or_create_eval_dataset("selected_patent", eval_df_patent, selected_chain_patent)
Loaded selected_patent dataset from data/data_new/gold/datasets/selected_patent_dataset.json
As a judge we use the GPT-4o-mini model. This model is a smaller version of the GPT-4o model. Whilst it is not as powerful as the full GPT-4o model it is still a very powerful model and can be used to evaluate the performance of our RAG pipeline without having to high costs.
It has also been suggest in Literature that when evaluating LLMs with LLMS as judges the evaluation is more reliable when the judge a different model than the model being evaluated. This is because the models might have learned to exploit the weaknesses of the other model or have a certain bias to there own answers. https://arxiv.org/abs/2404.13076
def evaluate_sample_questions_for_df(eval_df):
judge = ChatOpenAI(model="gpt-4o")
question_prompt = ChatPromptTemplate.from_template(
"Answer the following question: {question}")
question_chain = question_prompt | judge | StrOutputParser()
# Sample questions from eval_df
sample_questions = eval_df["question"].sample(5).tolist()
for question in sample_questions:
try:
response = question_chain.invoke({"question": question})
print(f"Question: {question}\nResponse: {response}\n")
except Exception as e:
print(f"Error processing question: {question}\nError: {e}\n")
evaluate_sample_questions_for_df(eval_df_media)
Question: Who awarded the largest-ever offshore wind auction in Germany? Response: Germany's largest-ever offshore wind auction was awarded by the Federal Network Agency (Bundesnetzagentur). This auction marked a significant step in the country's efforts to expand its renewable energy capacity. Question: What is the significance of Tesla's Gigafactory Texas? Response: Tesla's Gigafactory Texas, also known as Giga Texas or Gigafactory Austin, holds substantial significance for several reasons: 1. **Strategic Location**: Situated near Austin, Texas, this factory is centrally located, which helps Tesla optimize logistics and distribution across the United States. This location is advantageous for shipping vehicles to both the East and West coasts and accessing international markets via Gulf ports. 2. **Increased Production Capacity**: Giga Texas is a key component of Tesla's strategy to increase its production capacity. It is expected to produce a large volume of vehicles annually, including the Model Y and the highly anticipated Cybertruck. This expansion is crucial for meeting growing demand and achieving Tesla's long-term production goals. 3. **Innovation and Technology**: The factory is designed to incorporate advanced manufacturing processes and technologies. Tesla aims to make it one of the most efficient automotive plants in the world, which could lower production costs and improve vehicle quality. 4. **Economic Impact**: The construction and operation of Gigafactory Texas have significant economic implications for the region, including job creation and investment in local infrastructure. The factory is expected to create thousands of direct jobs and support ancillary industries and services. 5. **Sustainability Efforts**: Tesla emphasizes sustainability in its operations, and Gigafactory Texas is no exception. The facility is designed to be environmentally friendly, with a focus on energy efficiency and the use of renewable energy sources, aligning with Tesla's mission to accelerate the world's transition to sustainable energy. 6. **Expansion of Product Line**: With Giga Texas, Tesla has the capacity to expand its product line, potentially introducing new models or variations of existing models. This expansion supports Tesla's efforts to cater to a broader market and diversify its offerings. Overall, Gigafactory Texas is a critical component of Tesla's growth strategy, enabling the company to enhance its manufacturing capabilities, explore new technologies, and maintain its competitive edge in the rapidly evolving automotive industry. Question: Who is already exploring the use of EV batteries for backup generation? Response: Several companies and organizations are exploring the use of electric vehicle (EV) batteries for backup generation and grid support. Some of the notable ones include: 1. **Automakers**: Companies like Nissan, Tesla, and Ford are actively exploring and implementing vehicle-to-grid (V2G) technology, which allows EVs to feed electricity back into the grid or power homes during outages. 2. **Utilities**: Various utility companies around the world are partnering with automakers and technology firms to test and deploy V2G solutions. These include companies in Europe, North America, and Asia that are interested in stabilizing the grid and enhancing energy storage capabilities. 3. **Technology Firms**: Companies like Siemens and ABB are involved in developing the infrastructure and technology needed to support V2G systems and integrate them with existing grid operations. 4. **Government and Research Institutions**: Various government agencies and research institutions are funding and conducting studies to understand the potential benefits and challenges of using EV batteries for backup generation and grid stability. These efforts are part of a broader push towards more sustainable and resilient energy systems, leveraging the growing number of EVs on the road. Question: What is the name of the Isle of Man's Innovation Challenge, which is fostering cleantech, AI, and FinTech solutions for sustainability and net-zero goals? Response: The Isle of Man's Innovation Challenge is known as the "Island Innovator Challenge." This initiative focuses on fostering cleantech, AI, and FinTech solutions to support sustainability and net-zero goals. Question: Who is the author of the Mercom report on India's energy storage landscape? Response: The Mercom report on India's energy storage landscape is published by Mercom Capital Group, a global clean energy research and communications firm. The specific author of the report is typically not highlighted, as it is produced by their research team. If you need the most accurate and detailed information, it's best to refer directly to the report or contact Mercom Capital Group.
evaluate_sample_questions_for_df(eval_df_patent)
Question: What problem does the power control method aim to solve? Response: The power control method aims to solve the problem of managing and optimizing the transmission power levels of devices in wireless communication networks. The primary goals are to: 1. **Minimize Interference:** By adjusting the power levels, the method reduces interference between devices, which is crucial for maintaining the quality of communication in densely populated networks. 2. **Improve Signal Quality:** Proper power control ensures that the signal-to-noise ratio (SNR) is maintained at an optimal level, improving the overall quality and reliability of the communication link. 3. **Extend Battery Life:** In battery-powered devices, power control helps conserve energy by using only the necessary amount of power to maintain a connection, thereby extending the device's operational life. 4. **Maximize Network Capacity:** Efficient power control allows more users to share the same frequency spectrum without degrading service quality, effectively increasing the network's capacity. 5. **Facilitate Fair Resource Allocation:** By managing power levels, the method helps in providing fair access to network resources for all users, preventing scenarios where users close to the base station overpower those further away. Overall, power control is a key technique in wireless communications for enhancing network performance, efficiency, and user experience. Question: How does the aeration in the good oxygen pond ensure uniform distribution of sewage? Response: Aeration in a pond, often referred to as a lagoon or aerated lagoon in wastewater treatment, plays a crucial role in ensuring the uniform distribution of sewage. Here's how it achieves that: 1. **Mixing Action**: Aeration involves the introduction of air into the pond, typically through diffusers or mechanical aerators. This process creates turbulence in the water, facilitating the mixing of sewage throughout the pond. The continuous circulation helps prevent the settling of solids at the bottom and promotes even distribution of organic material. 2. **Oxygen Supply**: The aeration process increases the dissolved oxygen levels in the pond. Oxygen is essential for the aerobic microorganisms that break down organic matter in the sewage. By maintaining adequate oxygen levels throughout the pond, aeration ensures that these microorganisms are evenly distributed and active, promoting uniform treatment. 3. **Prevention of Stratification**: Without aeration, ponds can become stratified, with layers of different temperatures and compositions. Aeration helps to break down these layers, ensuring that the entire pond maintains a consistent environment conducive to wastewater treatment. This prevents the formation of anaerobic zones that could lead to uneven treatment and foul odors. 4. **Enhanced Biological Activity**: Aeration encourages the growth and activity of aerobic bacteria, which are more efficient in breaking down organic pollutants than anaerobic bacteria. This biological activity is spread throughout the pond, ensuring that all areas participate in the treatment process. Overall, aeration is essential for maintaining the homogeneity of the pond's contents, optimizing the conditions for biological treatment, and ensuring efficient and uniform processing of sewage. Question: What is the purpose of growing Prussian blue on the titanium dioxide-based silicon dioxide fiber film? Response: Growing Prussian blue on titanium dioxide-based silicon dioxide fiber film can serve several purposes, largely depending on the intended application: 1. **Catalysis**: Prussian blue can act as a catalyst or catalyst support. When grown on titanium dioxide (TiO2), which is also a known photocatalyst, the combination can enhance the catalytic properties, making it useful for applications such as environmental remediation or chemical synthesis. 2. **Sensing**: Prussian blue is known for its electrochromic properties, meaning it can change color when an electric charge is applied. When integrated with a TiO2-based silicon dioxide fiber film, it can be used in sensors, particularly those that detect gases or changes in environmental conditions. 3. **Energy Storage**: The combination of Prussian blue and TiO2 can be used in energy storage devices. Prussian blue has been explored as a material for battery electrodes, including sodium-ion and lithium-ion batteries, due to its ability to undergo reversible redox reactions. 4. **Photovoltaics**: TiO2 is widely used in photovoltaic devices, and Prussian blue can enhance the light absorption and electron transport properties of these devices, potentially leading to more efficient solar cells. 5. **Biomedical Applications**: Prussian blue has applications in biomedicine, including drug delivery and imaging. The biocompatibility and surface chemistry of TiO2-based silicon dioxide fibers can be utilized to functionalize Prussian blue for such applications. Overall, the purpose of growing Prussian blue on titanium dioxide-based silicon dioxide fiber film is to leverage the synergistic properties of these materials to enhance their performance in specific applications, such as catalysis, sensing, energy storage, photovoltaics, or biomedicine. Question: How does the device facilitate the transportation of the double ladder body? Response: To provide a specific answer, I would need more information about the device in question and the type of double ladder body being referred to. However, generally speaking, devices that facilitate the transportation of ladders often include features such as: 1. **Wheels or Casters**: These allow for easy rolling of the ladder across surfaces, reducing the need to lift and carry it. 2. **Foldability or Collapsibility**: This feature allows the ladder to be compacted into a smaller size, making it easier to transport and store. 3. **Carrying Handles**: Strategically placed handles can make it easier to grab and move the ladder. 4. **Lightweight Materials**: Using materials like aluminum or fiberglass can reduce the overall weight of the ladder, making it easier to carry. 5. **Straps or Clamps**: These might be used to secure the ladder during transportation, especially if it is being transported on a vehicle. 6. **Balance and Stability Features**: These can help in maintaining the ladder’s balance during transit, preventing tipping or falling. If you have a specific device or ladder model in mind, please provide more details for a more precise explanation. Question: What is a benefit of using this device? Response: To provide a specific benefit, I would need to know which device you are referring to. Could you please provide more details or specify the device in question?
Evaluating the evaluation datasets¶
# selected_llm_eval_results = get_or_run_llm_eval("selected", selected_dataset, llm)
# plot_llm_eval("selected", selected_llm_eval_results)
selected_media_llm_eval_results = get_or_run_llm_eval("selected_media", selected_dataset_media, llm)
plot_llm_eval("selected_media", selected_media_llm_eval_results)
Loaded selected_media evaluation results from data/data_new/gold/results/selected_media_llm_eval_results.csv
selected_patent_llm_eval_results = get_or_run_llm_eval("selected_patent", selected_dataset_patent, llm)
plot_llm_eval("selected_patent", selected_patent_llm_eval_results)
Loaded selected_patent evaluation results from data/data_new/gold/results/selected_patent_llm_eval_results.csv
Here we could see for patent and media dataset the evaluation matrics are completely different. But for both of them evaluation mean scores are less than 0.6. The matrices faithfulness, answer relevance, context relevance and answer correctness have varied values for both the datasets. It shows how versatile our data is.
datasets = {}
for name, chain in chains.items():
if "media" in name:
datasets[name] = get_or_create_eval_dataset(name, eval_df_media, chain)
elif "patent" in name:
datasets[name] = get_or_create_eval_dataset(name, eval_df_patent, chain)
Loaded bge-m3_recursive_256_media dataset from data/data_new/gold/datasets/bge-m3_recursive_256_media_dataset.json Loaded bge-m3_recursive_256_patent dataset from data/data_new/gold/datasets/bge-m3_recursive_256_patent_dataset.json Loaded bge-m3_recursive_1024_media dataset from data/data_new/gold/datasets/bge-m3_recursive_1024_media_dataset.json Loaded bge-m3_recursive_1024_patent dataset from data/data_new/gold/datasets/bge-m3_recursive_1024_patent_dataset.json Loaded bge-m3_semantic_media dataset from data/data_new/gold/datasets/bge-m3_semantic_media_dataset.json Loaded bge-m3_semantic_patent dataset from data/data_new/gold/datasets/bge-m3_semantic_patent_dataset.json
llm_results = {}
for dataset_name, dataset in datasets.items():
llm_results[dataset_name] = get_or_run_llm_eval(dataset_name, dataset, llm)
Loaded bge-m3_recursive_256_media evaluation results from data/data_new/gold/results/bge-m3_recursive_256_media_llm_eval_results.csv Loaded bge-m3_recursive_256_patent evaluation results from data/data_new/gold/results/bge-m3_recursive_256_patent_llm_eval_results.csv Loaded bge-m3_recursive_1024_media evaluation results from data/data_new/gold/results/bge-m3_recursive_1024_media_llm_eval_results.csv Loaded bge-m3_recursive_1024_patent evaluation results from data/data_new/gold/results/bge-m3_recursive_1024_patent_llm_eval_results.csv Loaded bge-m3_semantic_media evaluation results from data/data_new/gold/results/bge-m3_semantic_media_llm_eval_results.csv Loaded bge-m3_semantic_patent evaluation results from data/data_new/gold/results/bge-m3_semantic_patent_llm_eval_results.csv
We can see the comparison of different categories of the questions on the evaluation matrices for Media dataset. For media dataset Sustainability and Technological Innovation Questions work comparably better than others.
plot_multiple_evals_by_category({"selected_media": selected_media_llm_eval_results})
We can see the comparison of different categories of the questions on the evaluation matrices for Patent dataset. For patent dataset its hard to select one category as best performer. All have similar performance.
plot_multiple_evals_by_category({"selected_patent": selected_patent_llm_eval_results})
From the plot below we can conclude that semantic chunks has better results than others.
plot_multiple_evals(llm_results)
mean_scores = {}
for name, results in llm_results.items():
mean_scores[name] = results.select_dtypes(include=[np.float64]).mean()
total_mean_scores = pd.DataFrame(mean_scores).mean()
total_mean_scores.sort_values(ascending=False)
| 0 | |
|---|---|
| bge-m3_semantic_patent | 0.612933 |
| bge-m3_semantic_media | 0.603724 |
| hyde_patent | 0.581985 |
| bge-m3_recursive_256_media | 0.570399 |
| bge-m3_recursive_256_patent | 0.564499 |
| bge-m3_recursive_1024_patent | 0.549984 |
| hyde_media | 0.220992 |
| bge-m3_recursive_1024_media | NaN |
From the evaluation we can see that the RAG pipeline using the BGE-M3 model along with semantic chunking with have on average across the metrics the best performance. This is likely due to the fact that these embedding models are the most powerful and the semantic chunking with a versatile chunk size has enough context to the LLM but not too much that it gets distracted.
best_collection_media = collections["bge-m3_semantic_media"]
best_store_media = collection_to_store("bge-m3_semantic_media", embedding_models["bge-m3"])
best_collection_patent = collections["bge-m3_semantic_patent"]
best_store_patent = collection_to_store("bge-m3_semantic_patent", embedding_models["bge-m3"])
Advanced Methods¶
In this final section we will look at some more advanced methods to improve our RAG pipeline and comparing them to our best performing pipeline.
Multi-Querying¶
Multi-querying is a technique that involves querying the retrieval model with multiple questions to retrieve relevant chunks. This approach can enhance the retrieval process by leveraging the diversity of queries to capture a broader range of relevant information. By combining the results from multiple queries, we can potentially improve the quality of the retrieved chunks and, consequently, the generated responses. When creating these additional queries the goal is to create queries that are different from the original query but still relevant to the user's information need, i.e variations of the original query.

def generate_query_variations(query: str, num_additional_queries: int) -> List[str]:
multiquery_prompt = """You are an assistant tasked with generating {num_queries} \
different versions of the given user question to retrieve relevant documents from a vector \
database. By generating multiple perspectives on the user question and breaking it down, your goal is to help \
the user overcome some of the limitations of the distance-based similarity search. \
Provide these alternative questions separated by newlines without any numbering or listing.
Original question: {question}
Alternatives:
"""
multiquery_chain = ChatPromptTemplate.from_template(multiquery_prompt) | llm
return multiquery_chain.invoke({"question": query, "num_queries": num_additional_queries}).content.split("\n")
def plot_multiquery_retrieval_results(query: str, collection : Collection, num_additional_queries: int = 3, num_results: int = 3):
vectors = get_vectors_from_collection(collection)
umap_transform = fit_umap(vectors)
vectors_projections = project_embeddings(vectors, umap_transform)
query_projections = project_embeddings(collection._embedding_function([query]), umap_transform)
query_variations = generate_query_variations(query, 5)
query_variations_projections = project_embeddings(collection._embedding_function(query_variations), umap_transform)
original_relevant_docs = collection.query(
query_texts=[query],
n_results=num_results,
)
original_relevant_docs_ids = [item for sublist in original_relevant_docs["ids"] for item in sublist] # flatten
original_relevant_docs_embeddings = collection.get(include=["embeddings"], ids=original_relevant_docs_ids)["embeddings"]
original_relevant_docs_projections = project_embeddings(original_relevant_docs_embeddings, umap_transform)
additional_relevant_docs = collection.query(
query_texts=query_variations,
n_results=num_results,
)
additional_relevant_docs_ids = [item for sublist in additional_relevant_docs["ids"] for item in sublist] # flatten
# remove duplicates
additional_relevant_docs_ids = list(set(additional_relevant_docs_ids))
# remove the original relevant docs from the additional relevant docs
additional_relevant_docs_ids = [doc_id for doc_id in additional_relevant_docs_ids if doc_id not in original_relevant_docs_ids]
additional_relevant_docs_embeddings = collection.get(include=["embeddings"], ids=additional_relevant_docs_ids)["embeddings"]
additional_relevant_docs_projections = project_embeddings(additional_relevant_docs_embeddings, umap_transform)
fig = go.Figure()
fig.add_trace(go.Scatter(x=vectors_projections[:, 0], y=vectors_projections[:, 1], mode='markers', marker=dict(size=5), name="other vectors"))
fig.add_trace(go.Scatter(x=query_projections[:, 0], y=query_projections[:, 1], mode='markers', marker=dict(size=7, color='black', symbol='x'), name="original query"))
fig.add_trace(go.Scatter(x=query_variations_projections[:, 0], y=query_variations_projections[:, 1], mode='markers', marker=dict(size=7, color='red', symbol='x'), name="query variations"))
fig.add_trace(go.Scatter(x=original_relevant_docs_projections[:, 0], y=original_relevant_docs_projections[:, 1], mode='markers', marker=dict(size=7, color='orange'), name="original relevant docs"))
fig.add_trace(go.Scatter(x=additional_relevant_docs_projections[:, 0], y=additional_relevant_docs_projections[:, 1], mode='markers', marker=dict(size=7, color='green'), name="additional relevant docs"))
fig.show(renderer="colab")
plot_multiquery_retrieval_results("Climate Change", selected_collection_media)
plot_multiquery_retrieval_results("Climate Change", selected_collection_patent)
class MultiQueryRetriever(BaseRetriever):
store: VectorStore
num_additional_queries: int = 3
num_results: int = 3
def _get_query_variations(self, query: str) -> List[str]:
return generate_query_variations(query, self.num_additional_queries)
def _get_relevant_documents(
self, original_query: str, *, run_manager: CallbackManagerForRetrieverRun
) -> List[Document]:
queries = self._get_query_variations(original_query)
queries.append(original_query)
retriever = store_to_retriever(self.store, k=self.num_results)
relevant_docs = []
for query in queries:
results = retriever.invoke(query)
# remove duplicates
for res in results:
if res not in relevant_docs:
relevant_docs.append(res)
return relevant_docs
multiquery_retriever = MultiQueryRetriever(store=best_store_media, num_additional_queries=3, num_results=3)
multiquery_chain = create_qa_chain(multiquery_retriever)
multiquery_chain.invoke("Where are the biggest increases in wildfire smoke exposure in recent years?")
{'context': [Document(metadata={'domain': 'cleantechnica', 'id': 554, 'title': 'Flooding Issue in NYC? Rooftop Gardens for Bus Stops to the Rescue!', 'url': 'cleantechnica.com/2023/06/15/flooding-issue-in-nyc-rooftop-gardens-for-bus-stops-to-the-rescue'}, page_content='Advertise with CleanTechnica to get your company in front of millions of monthly readers. In the midst of eye-popping summer precipitation in New England, the Deputy Commissioner at the New Hampshire Insurance Department, D.J. Bettencourt, encouraged New Hampshire... This summer the Northern Hemisphere has been so hot with record temperatures — including at sea — that discussions have turned to the limits... Copyright © 2023 CleanTechnica.'),
Document(metadata={'domain': 'energy-xprt', 'id': 1305, 'title': 'Fossil Power Plant ( Fossil Energy) Articles', 'url': 'energy-xprt.com/fossil-energy/fossil-power-plant/articles'}, page_content="So why aren't developers flocking... The American Lung Association's agenda for the new administration, Protect the Air We Breathe: An Agenda for Clean Air, states: `` Climate, energy and clean air are inexorably linked. Solutions that lead to cleaner air must be included in any approach to cleaner, more efficient energy use and reductions in global warming. '' 1 Wind energy is one such solution - a clean energy source that can... The State of Illinois has the largest non–renewable ( fossil) energy reserves among all the states in the USA. We report our studies on coal, oil and gas energy resources, conversions, consumptions and carbon dioxide sequestration advances in Illinois from the point of view of sustainability in energy and environmental. This includes reserves and characteristics of coal, oil and gas in the... Developed country governments have repeatedly committed to provide new and additional finance to help developing countries transition to low-carbon and climate-resilient growth. This assessment considers Japan’ s efforts to provide “ fast start finance ” ( FSF) between January 2010 and February 2012 in the context of the pledge by developed countries to mobilize USD 30 billion from... More Asia–Pacific countries need to embrace renewable energy and follow the first tentative steps of some governments, says Crispin Maslog. The South-East Asia and Pacific region is blessed with abundant sources of 'green ' energy — including sun, wind, water, biomass and geothermal — but governments are still not doing enough to harness them. The 30 countries in this part of... By SciDev.Net Carbon Capture and Storage ( CCS) consists of the capture of carbon dioxide ( CO2) from power plants and/or CO2-intensive industries such as refineries, cement, iron and steel, its subsequent transport to a storage site, and finally its injection into a suitable underground geological formation for the purposes of permanent storage. It is considered to be one of the medium term 'bridging... The global energy system is undergoing a transition from fossil fuels to renewable energy. There are clear signs that the pace of change is accelerating. 2009 was the second year in a row that more money was invested worldwide in renewable electricity generation projects than in fossil fuel-powered plants, according to data published by the United Nations. Developing countries, especially in... Climate Change and CCS In facing the challenge of mitigating global climate change, world leaders have acknowledged that no single solution exists, and therefore, a portfolio of carbon dioxide ( CO2) reduction technologies and methods will be needed to successfully confront rising emissions. Due to their dependency on fossil fuels, the energy supply and industrial sectors are the greatest..."),
Document(metadata={'domain': 'energyvoice', 'id': 1727, 'title': 'Cambo: A year on from Shell decision not to invest', 'url': 'energyvoice.com/oilandgas/north-sea/west-of-shetland/474463/cambo-a-year-on-shell-ithaca'}, page_content='“ There are a number of large-ish developments out there already, ” he says, pointing to areas including Perth, Buchan and Galapagos, “ and stepping into the firing line of Cambo is probably not high on anyone’ s priority list, I wouldn’ t have thought. ” Cambo, in the west of Shetland, is estimated to hold up to 800m barrels of oil in-place, with the first phase expected to recover 170m barrels. In the run up to COP26 in November 2021, it was one of the most high-profile and environmentally contested oilfields in the world, making frequent national headlines.'),
Document(metadata={'domain': 'cleantechnica', 'id': 508, 'title': 'World’ s Largest Floating Solar Array, Manchin Movement On Climate — Nexus News Roundup', 'url': 'cleantechnica.com/2022/01/06/worlds-largest-floating-solar-array-manchin-movement-on-climate-nexus-news-roundup'}, page_content='It’ s a good problem to have. It’ s a lot better than the problem of ‘ we have no resources at all.’ ” A raging urban firestorm fueled by hurricane-force winds incinerated nearly 1,000 homes and structures in and around Boulder, CO in just one day, making it the most destructive wildfire in the state’ s history. “ With CLIMATE CHANGE, there is no FIRE SEASON anymore, ” tweeted Mike Nelson, chief meteorologist for Denver7, the city’ s ABC affiliate. Climate change, primarily caused by the extraction and combustion of fossil fuels, supercharges fires like the Marshall Fire through increased temperatures and exacerbated drought. The blaze, which ignited Thursday, was effectively extinguished by snowfall by the next day. More than 30,000 people were forced to evacuate but just two people were missing as of Monday. How to give and get help ( Boulder Daily Camera), how climate change primed Colorado for a rare December wildfire ( CNBC), fires outside of Denver were the most destructive in Colorado history ( NPR), climate change-fueled blaze destroys 1,000 homes in Colorado in rare winter wildfire ( Democracy Now), climate scientists grapple with wildfire disaster in their backyard ( Axios), photos: wildfires engulf 1,000 homes in suburban Denver ( NPR). A syndicated newswire covering climate, energy, policy, art and culture.'),
Document(metadata={'domain': 'energy-xprt', 'id': 1648, 'title': 'Solar Regulations ( Solar Energy) Articles', 'url': 'energy-xprt.com/solar-energy/solar-regulations/articles'}, page_content='manufacturing—and the jobs that go with it—have been steadily increasing since 2010. As President Obama mentioned during last month’ s State of the Union address, the U.S. economy added 568,000 new manufacturing sector jobs between January 2010 and December 2013. Meanwhile, industry—of which manufacturing is the largest component—reduced its energy-related CO2... The Black Sea, a highly isolated inland sea, is the largest anoxic zone in the world. Since the hydrogen sulphide ( H2S) zone was discovered in the early 19th century in the Black Sea, it has been accepted that there is no life in the depths of the Black Sea and only bacteria live in the H2S layer. A high content of organic matter, with maximum processes of bacterial sulphate reduction, is the... This study provides an easy to follow description of the second law ( of thermodynamics) method as applied to a single effect absorption refrigeration cycle. Two different sets of working fluids: LiBr-H2O and NH3-H2O solution are considered, and a method to calculate the entropy of both working fluids is offered. The total exergy destruction in the system as a percentage of the exergy input from... This paper presents a statistical model that is able to predict carbon monoxide ( CO) concentrations as a function of meteorological conditions and various air quality parameters. The experimental work was conducted in an urban atmosphere, where the emissions from cars are prevalent. A mobile air pollution monitoring laboratory was used to collect data, which were divided into two groups: a... Need help finding the right suppliers?'),
Document(metadata={'domain': 'iea', 'id': 1893, 'title': 'Executive summary – Renewables 2023 – Analysis', 'url': 'iea.org/reports/renewables-2023/executive-summary'}, page_content='Modern renewable heat consumption expands by 40% globally during the outlook period, rising from 13% to 17% of total heat consumption. These developments come predominantly from the growing reliance on electricity for process heat – notably with the adoption of heat pumps in non‑energy‑intensive industries – and the deployment of electric heat pumps and boilers in buildings, increasingly powered by renewable electricity. China, the European Union and the United States lead these trends, owing to supportive policy environments; updated targets in the European Union and China; strong financial incentives in many markets; the adoption of renewable heat obligations; and fossil fuel bans in the buildings sector. However, the trends to 2028 are still largely insufficient to tackle the use of fossil fuels for heat and put the world on track to meet Paris Agreement goals. Without stronger policy action, the global heat sector alone between 2023 and 2028 could consume more than one‑fifth of the remaining carbon budget for a pathway aligned with limiting global warming to 1.5°C. Global renewable heat consumption would have to rise 2.2 times as quickly and be combined with wide-scale demand-side measures and much larger energy and material efficiency improvements to align with the NZE Scenario. Get updates on the IEA’ s latest news, analysis, data and events delivered twice monthly.'),
Document(metadata={'domain': 'cleantechnica', 'id': 3299, 'title': "As Lithium Supply Battle Heats Up, Stellantis Pays A $ 100 Million Visit To Hell's Kitchen", 'url': 'cleantechnica.com/2023/08/23/as-lithium-supply-battle-heats-up-stellantis-pays-a-100-million-visit-to-hells-kitchen'}, page_content='Image: Hell’ s Kitchen geothermal power plant and lithium supply facility with proposed EV battery operations, courtesy of CTR. Tina specializes in military and corporate sustainability, advanced technology, emerging materials, biofuels, and water and wastewater issues. Views expressed are her own. Follow her on Twitter @ TinaMCasey and Spoutible. Advertise with CleanTechnica to get your company in front of millions of monthly readers. The California Energy Commission just released energy data showing that solar power electricity production in California increased almost twenty times since 2012. The increase... SACRAMENTO — The California Air Resources Board today announced that it will transition its existing Clean Vehicle Rebate Project ( CVRP) program to a new... California’ s been getting a lot of bad press lately.'),
Document(metadata={'domain': 'cleantechnica', 'id': 691, 'title': 'Sarah Lozanova, Author at CleanTechnica', 'url': 'cleantechnica.com/author/sarahlozanova'}, page_content='If you look at the values of most... The setting for this discovery sounds like something out of a Dr. Seuss book. A fungus that grows in Ulmo trees in the Patagonian... The credit crunch is not just hurting the banks and the real estate market. Even the billionaire and wind energy enthusiast, T. Boone Pickens... By turning a long line of mirrors, the first solar thermal plant in nearly two decades was launched last week in Bakersfield, California. Unlike... Clean coal has been getting a lot of attention lately. Both Mr. McCain and Mr. Obama consider it to be an important piece in... Copyright © 2023 CleanTechnica.')],
'question': 'Where are the biggest increases in wildfire smoke exposure in recent years?',
'answer': "I don't know."}
multiquery_retriever = MultiQueryRetriever(store=best_store_patent, num_additional_queries=3, num_results=3)
multiquery_chain = create_qa_chain(multiquery_retriever)
multiquery_chain.invoke("Where are the biggest increases in wildfire smoke exposure in recent years?")
{'context': [Document(metadata={'id': 921, 'title': 'Control method of hydrogen energy combustion-supporting ultralow nitrogen combustor', 'topic': 4}, page_content='on the analysis result, the fuel quantity of various substances involved in the combustion process and the recovery quantity of high-temperature flue gas are adjusted, so that new combustible ultralow-nitrogen and hydrogen-rich mixed gas with a brand new proportion is realized, and the purposes of increasing the combustion temperature, ultralow-nitrogen emission, minimum smoke discharge and energy conservation and environmental protection are achieved.'),
Document(metadata={'id': 1721, 'title': 'Standby fire extinguishing system', 'topic': 9}, page_content='Standby fire extinguishing system: The utility model relates to the field of fire protection, in particular to a standby fire protection system which comprises a power supply module, an environment detection module, a main controller and a fire protection module; the main controller is respectively connected with the environment detection module, the fire control module and the power supply module; the environment detection module comprises a temperature detection device and a smoke detection device, wherein the temperature detection device is used for detecting temperature and outputting a temperature detection signal, and the smoke detection device is used for detecting smoke concentration and outputting a smoke detection signal; the main controller receives a temperature detection signal and a smoke detection signal, the temperature detection signal reaches a temperature preset value, and the smoke detection signal reaches a smoke preset value at the same time, and a water supply signal is output; the fire'),
Document(metadata={'id': 242, 'title': 'Power management system applied to forest fire response', 'topic': 0}, page_content='Power management system applied to forest fire response: The invention discloses a power management system applied to forest fire response, belonging to the technical field of power management and comprising a main body shell, wherein a hollow column is connected in a threaded manner at the middle position of the top side of the main body shell, a top groove is formed in the top side of the main body shell, a driving rod is rotatably connected to the top side of the hollow column, a driven rod is rotatably connected to the driving rod, a limit groove is formed in the top side of the driven rod, the driven rod is rotatably connected with the main body shell, the bottom side of the driven rod extends into the main body shell, a threaded rod is connected in a threaded manner at the top side of the driving rod, the bottom end of the threaded rod is matched with the limit groove, a limit plate is fixedly arranged at the position, close to the top, of the outer wall of the driving rod, an air inlet hole and an air'),
Document(metadata={'id': 1243, 'title': 'Spinel-corundum dual-phase high-entropy ceramic powder material and preparation method thereof', 'topic': 6}, page_content='(1700 ℃), and can be widely used as an infrared radiation material and a solar energy absorbing material in the fields of aerospace, power station boilers, photo-thermal power generation, interface evaporation, photo-thermal deicing and the like.'),
Document(metadata={'id': 1147, 'title': 'Mountain gorge area ancient village environmental protection processing system', 'topic': 5}, page_content='the rivers and gets into good oxygen pond, is connected with aeration equipment in the good oxygen pond, and aeration equipment adopts the solar energy power supply, and the play water after good oxygen pond is handled of sewage gets into constructed wetland, goes out the water up to standard after constructed wetland is handled and arrives in the sample well. The purpose of this patent is to solve current sewage treatment technique and lack the pertinence processing system to ancient village domestic sewage, the capital construction investment of sewage centralized treatment is big, power consumption is many, the working costs is high, the not good problem of treatment effect.'),
Document(metadata={'id': 1805, 'title': 'Utilize solar energy to realize source separation urine and excrement and urine resourceization's processing system', 'topic': 6}, page_content='collection chamber completely, and the printing opacity condensing plate covers in the evaporating chamber top and guides the comdenstion water to the collection chamber. According to the invention, after the source separated urine is deodorized and nitrogen-fixed by the acidified urine pool, the source separated urine enters the urine solar photo-thermal evaporation device, the urine is efficiently evaporated to dryness to reduce the volume of the urine, the urine is concentrated into high-concentration nitrogen fertilizer and phosphate fertilizer to realize recycling, and the steam contacts the condensing plate to become water drops to flow into the collecting chamber to realize water recycling; the excrement separated from the source is subjected to efficient solar photo-thermal evaporation to realize the drying of the excrement.'),
Document(metadata={'id': 477, 'title': 'Electric power peak regulation system based on microalgae respiration coupling solar energy and supercritical hydrothermal reaction', 'topic': 1}, page_content='Electric power peak regulation system based on microalgae respiration coupling solar energy and supercritical hydrothermal reaction: The invention discloses a power peak regulation system based on microalgae respiration coupling solar energy and supercritical hydrothermal reaction, relates to the field of carbon emission reduction, and comprises CO 2 A production system, a solar energy utilization system, a supercritical hydrothermal reaction system and a thermoelectric generation system. CO generated by supercritical hydrothermal reaction system 2 As working medium to generate electricity all day long, and the CO generated by the night respiration of the microalgae 2 Compressing, heating, and storing to obtain supercritical CO 2 The power generation replenishing working medium effectively relieves the peak of power consumption in the daytime; meanwhile, flue gas generated by the supercritical hydrothermal reactor can be used for heating O in the preheater 2 High temperature and high pressure CO'),
Document(metadata={'id': 984, 'title': 'Steam plasma burner device with in-cycle gasification of fuel', 'topic': 4}, page_content='nozzle, a plasma electrode is installed coaxially with it, electrically connected to a source of plasma-generating electric current and electrically isolated from the first Laval nozzle and the housing covering the linear chain of Laval nozzles, while the firing chamber is equipped with an air channel for supplying air to it and a plasma return channel from the firing chamber to the return channel of the nozzle.EFFECT: increase in the reliability of the burner device by eliminating the wear of the electrodes and increasing its efficiency by ensuring the maximum possible completeness of the combustion of the hydrocarbon component.1 cl, 1 dwg'),
Document(metadata={'id': 492, 'title': 'LNG light hydrocarbon separation coupling enhancement type geothermal flashing organic Rankine combined cycle power generation system', 'topic': 1}, page_content='LNG light hydrocarbon separation coupling enhancement type geothermal flashing organic Rankine combined cycle power generation system: An LNG light hydrocarbon separation coupling enhanced geothermal flashing organic Rankine combined cycle power generation system comprises an LNG light hydrocarbon separation system, a geothermal flashing circulation system, an organic Rankine circulation system and a natural gas direct expansion system; the LNG light hydrocarbon separation system is used for recovering C2+ light hydrocarbon resources; the geothermal flash evaporation circulating system, the organic Rankine circulating system and the natural gas direct expansion system are used for coupling LNG cold energy and medium and low temperature geothermal energy to generate electricity; the system performance is further enhanced by respectively arranging a precooler and a reheater in the organic Rankine cycle system and the geothermal flash evaporation cycle system; the invention realizes the effective recovery of C2+')],
'question': 'Where are the biggest increases in wildfire smoke exposure in recent years?',
'answer': "I don't know."}
def run_multiquery_strategy(eval_df, strategy_name):
strategy_results = {}
if "media" in strategy_name:
key = "multiquery_media"
elif "patent" in strategy_name:
key = "multiquery_patent"
else:
raise ValueError(f"Unknown strategy name: {strategy_name}")
# Step 1: Create or get the dataset
datasets[key] = get_or_create_eval_dataset(key, eval_df, multiquery_chain)
# Step 2: Run or retrieve the evaluation results
llm_results[key] = get_or_run_llm_eval(key, datasets[key], llm)
# Step 3: Store results and plot
strategy_results[strategy_name] = llm_results.get(strategy_name, {})
strategy_results[key] = llm_results[key]
plot_multiple_evals(strategy_results)
For media dataset¶
from datasets import Dataset
import json
from tqdm import tqdm
datasets_folder = gold_folder / "datasets"
datasets_folder.mkdir(exist_ok=True)
# Recursive cleaning for PyArrow compatibility
def clean_for_arrow(value):
if isinstance(value, list):
return [clean_for_arrow(v) for v in value]
if value is None or isinstance(value, float):
return ""
return str(value)
def convert_to_strings(datapoints):
return {key: clean_for_arrow(val) for key, val in datapoints.items()}
def get_or_create_eval_dataset(name: str, df: pd.DataFrame, chain) -> Dataset:
dataset_file = datasets_folder / f"{name}_dataset.json"
if dataset_file.exists():
with open(dataset_file, "r") as file:
raw_data = json.load(file)
cleaned_data = convert_to_strings(raw_data)
dataset = Dataset.from_dict(cleaned_data)
print(f"Loaded {name} dataset from {dataset_file}")
else:
datapoints = {
"question": df["question"].tolist(),
"answer": [],
"contexts": [],
"ground_truth": df["ground_truth"].tolist(),
"context_urls": [],
"category": df["category"].tolist() if "category" in df.columns else ["" for _ in df.index]
}
for question in tqdm(datapoints["question"], desc=f"Generating {name}"):
result = chain.invoke(question)
datapoints["answer"].append(result.get("answer", ""))
datapoints["contexts"].append([doc.page_content for doc in result.get("context", [])])
datapoints["context_urls"].append([doc.metadata.get("url", "") for doc in result.get("context", [])])
# Clean data before writing to JSON and converting to Dataset
cleaned_data = convert_to_strings(datapoints)
with open(dataset_file, "w") as file:
json.dump(cleaned_data, file)
dataset = Dataset.from_dict(cleaned_data)
print(f"Saved {name} dataset to {dataset_file}")
return dataset
run_multiquery_strategy(eval_df_media.sample(100, random_state=42), "bge-m3_semantic_media")
Generating multiquery_media: 100%|██████████| 100/100 [07:12<00:00, 4.33s/it]
Saved multiquery_media dataset to data/data_new/gold/datasets/multiquery_media_dataset.json Loaded multiquery_media evaluation results from data/data_new/gold/results/multiquery_media_llm_eval_results.csv
We can see that on average the answer correctness and faithfullness does increase when using multi-querying. This is likely due to the fact that the retrieval process is more robust and can capture a broader range of relevant information. However, the answer_relevancy and context_relevancy decrease could be due to the multi-querying introducing more noise into the retrieval process by retrieving more chunks in general and some of them being less relevant.
For patent dataset¶
run_multiquery_strategy(eval_df_patent.sample(100, random_state=42), "bge-m3_semantic_patent")
Loaded multiquery_patent dataset from data/data_new/gold/datasets/multiquery_patent_dataset.json Loaded multiquery_patent evaluation results from data/data_new/gold/results/multiquery_patent_llm_eval_results.csv
With the above plot for patent dataset we can conclude that for patent dataset, using multiquery approach does not perform better than the original semantic embedding model. But for media dataset it improves the correctness and faithfullness.
HyDE - Hypothetical Document Embeddings¶
The idea of the HyDE method is to generate hypothetical documents that are similar to the user query and then retrieve the most similar chunks to these hypothetical documents. This can be useful when the user query is not very specific or when the user query is not very similar to the chunks. The HyDE method can be used to generate hypothetical documents that are more similar to the chunks and therefore improve the retrieval process. Another way to think about it is generating a hypothetical answer and therby reaching an area in the embedding space that is more similar to the actual answer which might not be reachable from the user query.

def generate_hypothetical_document(query: str, num_hypotheses: int) -> List[str]:
hyde_prompt = """Please write a news passage about the topic.
Topic: {query}
Passage:
"""
hyde_chain = ChatPromptTemplate.from_template(hyde_prompt) | llm
hypothetical_documents = [hyde_chain.invoke({"query": query}).content for _ in range(num_hypotheses)]
return hypothetical_documents
def plot_hyde_retrieval_results(query: str, collection : Collection, num_hypo_documents: int = 2, num_results: int = 3):
vectors = get_vectors_from_collection(collection)
umap_transform = fit_umap(vectors)
vectors_projections = project_embeddings(vectors, umap_transform)
query_projections = project_embeddings(collection._embedding_function([query]), umap_transform)
hypothetical_documents = generate_hypothetical_document(query, num_hypo_documents)
query_variations_projections = project_embeddings(collection._embedding_function(hypothetical_documents), umap_transform)
original_relevant_docs = collection.query(
query_texts=[query],
n_results=num_results,
)
original_relevant_docs_ids = [item for sublist in original_relevant_docs["ids"] for item in sublist] # flatten
original_relevant_docs_embeddings = collection.get(include=["embeddings"], ids=original_relevant_docs_ids)["embeddings"]
original_relevant_docs_projections = project_embeddings(original_relevant_docs_embeddings, umap_transform)
additional_relevant_docs = collection.query(
query_texts=hypothetical_documents,
n_results=num_results,
)
additional_relevant_docs_ids = [item for sublist in additional_relevant_docs["ids"] for item in sublist] # flatten
# remove duplicates
additional_relevant_docs_ids = list(set(additional_relevant_docs_ids))
# remove the original relevant docs from the additional relevant docs
additional_relevant_docs_ids = [doc_id for doc_id in additional_relevant_docs_ids if doc_id not in original_relevant_docs_ids]
additional_relevant_docs_embeddings = collection.get(include=["embeddings"], ids=additional_relevant_docs_ids)["embeddings"]
additional_relevant_docs_projections = project_embeddings(additional_relevant_docs_embeddings, umap_transform)
fig = go.Figure()
fig.add_trace(go.Scatter(x=vectors_projections[:, 0], y=vectors_projections[:, 1], mode='markers', marker=dict(size=5), name="other vectors"))
fig.add_trace(go.Scatter(x=query_projections[:, 0], y=query_projections[:, 1], mode='markers', marker=dict(size=7, color='black', symbol='x'), name="original query"))
fig.add_trace(go.Scatter(x=query_variations_projections[:, 0], y=query_variations_projections[:, 1], mode='markers', marker=dict(size=7, color='red', symbol='x'), name="hypothetical documents"))
fig.add_trace(go.Scatter(x=original_relevant_docs_projections[:, 0], y=original_relevant_docs_projections[:, 1], mode='markers', marker=dict(size=7, color='orange'), name="original relevant docs"))
fig.add_trace(go.Scatter(x=additional_relevant_docs_projections[:, 0], y=additional_relevant_docs_projections[:, 1], mode='markers', marker=dict(size=7, color='green'), name="additional relevant docs"))
fig.show(renderer="colab")
plot_hyde_retrieval_results("Climate Change", selected_collection_media)
plot_hyde_retrieval_results("Climate Change", selected_collection_patent)
class HyDERetriever(BaseRetriever):
store: VectorStore
num_hypo_documents: int = 2
num_results: int = 3
def _get_hypothetical_documents(self, query: str) -> List[str]:
return generate_hypothetical_document(query, self.num_hypo_documents)
def _get_relevant_documents(
self, original_query: str, *, run_manager: CallbackManagerForRetrieverRun
) -> List[Document]:
hypothetical_documents = self._get_hypothetical_documents(original_query)
hypothetical_documents.append(original_query)
retriever = store_to_retriever(self.store, k=self.num_results)
relevant_docs = []
for query in hypothetical_documents:
results = retriever.invoke(query)
# remove duplicates
for res in results:
if res not in relevant_docs:
relevant_docs.append(res)
return relevant_docs
hyde_retriever_media = HyDERetriever(store=best_store_media, k=3)
hyde_chain_media = create_qa_chain(hyde_retriever_media)
hyde_chain_media.invoke("Where are the biggest increases in wildfire smoke exposure in recent years?")
{'context': [Document(metadata={'domain': 'energy-xprt', 'id': 1305, 'title': 'Fossil Power Plant ( Fossil Energy) Articles', 'url': 'energy-xprt.com/fossil-energy/fossil-power-plant/articles'}, page_content="So why aren't developers flocking... The American Lung Association's agenda for the new administration, Protect the Air We Breathe: An Agenda for Clean Air, states: `` Climate, energy and clean air are inexorably linked. Solutions that lead to cleaner air must be included in any approach to cleaner, more efficient energy use and reductions in global warming. '' 1 Wind energy is one such solution - a clean energy source that can... The State of Illinois has the largest non–renewable ( fossil) energy reserves among all the states in the USA. We report our studies on coal, oil and gas energy resources, conversions, consumptions and carbon dioxide sequestration advances in Illinois from the point of view of sustainability in energy and environmental. This includes reserves and characteristics of coal, oil and gas in the... Developed country governments have repeatedly committed to provide new and additional finance to help developing countries transition to low-carbon and climate-resilient growth. This assessment considers Japan’ s efforts to provide “ fast start finance ” ( FSF) between January 2010 and February 2012 in the context of the pledge by developed countries to mobilize USD 30 billion from... More Asia–Pacific countries need to embrace renewable energy and follow the first tentative steps of some governments, says Crispin Maslog. The South-East Asia and Pacific region is blessed with abundant sources of 'green ' energy — including sun, wind, water, biomass and geothermal — but governments are still not doing enough to harness them. The 30 countries in this part of... By SciDev.Net Carbon Capture and Storage ( CCS) consists of the capture of carbon dioxide ( CO2) from power plants and/or CO2-intensive industries such as refineries, cement, iron and steel, its subsequent transport to a storage site, and finally its injection into a suitable underground geological formation for the purposes of permanent storage. It is considered to be one of the medium term 'bridging... The global energy system is undergoing a transition from fossil fuels to renewable energy. There are clear signs that the pace of change is accelerating. 2009 was the second year in a row that more money was invested worldwide in renewable electricity generation projects than in fossil fuel-powered plants, according to data published by the United Nations. Developing countries, especially in... Climate Change and CCS In facing the challenge of mitigating global climate change, world leaders have acknowledged that no single solution exists, and therefore, a portfolio of carbon dioxide ( CO2) reduction technologies and methods will be needed to successfully confront rising emissions. Due to their dependency on fossil fuels, the energy supply and industrial sectors are the greatest..."),
Document(metadata={'domain': 'energy-xprt', 'id': 1685, 'title': 'Urban Energy ( Energy Management) News', 'url': 'energy-xprt.com/energy-management/urban-energy/news'}, page_content='Urban Energy ( Energy Management) News: By Theresa Duque Since early July, the Earth has sweltered under record-breaking heat. In the United States, from California and the Desert Southwest to Texas and Florida, a long-lasting heat wave in the triple digits has broken dozens of heat records – and counting. To mitigate the risks of living in extreme heat, scientists at the Department of Energy’ s Lawrence Berkeley National Laboratory...'),
Document(metadata={'domain': 'iea', 'id': 1893, 'title': 'Executive summary – Renewables 2023 – Analysis', 'url': 'iea.org/reports/renewables-2023/executive-summary'}, page_content='Modern renewable heat consumption expands by 40% globally during the outlook period, rising from 13% to 17% of total heat consumption. These developments come predominantly from the growing reliance on electricity for process heat – notably with the adoption of heat pumps in non‑energy‑intensive industries – and the deployment of electric heat pumps and boilers in buildings, increasingly powered by renewable electricity. China, the European Union and the United States lead these trends, owing to supportive policy environments; updated targets in the European Union and China; strong financial incentives in many markets; the adoption of renewable heat obligations; and fossil fuel bans in the buildings sector. However, the trends to 2028 are still largely insufficient to tackle the use of fossil fuels for heat and put the world on track to meet Paris Agreement goals. Without stronger policy action, the global heat sector alone between 2023 and 2028 could consume more than one‑fifth of the remaining carbon budget for a pathway aligned with limiting global warming to 1.5°C. Global renewable heat consumption would have to rise 2.2 times as quickly and be combined with wide-scale demand-side measures and much larger energy and material efficiency improvements to align with the NZE Scenario. Get updates on the IEA’ s latest news, analysis, data and events delivered twice monthly.'),
Document(metadata={'domain': 'azocleantech', 'id': 237, 'title': 'Widespread Tree Scorch in the Pacific Northwest Mainly Attributed to Heat than to Drought Conditions', 'url': 'azocleantech.com/news.aspx?newsID=32920'}, page_content="Widespread Tree Scorch in the Pacific Northwest Mainly Attributed to Heat than to Drought Conditions: Widespread tree scorch in the Pacific Northwest that became visible shortly after multiple days of record-setting, triple-digit temperatures in June 2021 was more attributable to heat than to drought conditions, Oregon State University researchers say. In a paper published in Tree Physiology, a team led by Christopher Still of the OSU College of Forestry cites evidence that leaf discoloration and damage are consistent with direct exposure to solar radiation during the hottest afternoons of the `` heat dome '' that covered northwestern North America. Still and other scientists from OSU were responding to an article published in the same journal in April 2022 that concluded the trees ' problems were the result of drought and a failure in the trees ' hydraulic system, which helps foliage stay cool through the exhalation of water vapor via a process known as transpiration. The collaboration that produced the response following a literature review includes researchers from Oregon State's colleges of Engineering, Agricultural Sciences, and Earth, Ocean, and Atmospheric Sciences, as well as two other OSU-affiliated organizations, the Oregon Climate Change Research Institute and the PRISM Climate Group. `` While we think the drought/hydraulic hypothesis is partly true, we argue that multiple lines of evidence suggest the main issue was in fact direct heat damage, '' said Still, a tree physiologist who studies forests in the context of climate change impacts and feedbacks. `` Tree physiologists have worked a lot to show that hydraulic damage in response to drought drives a lot of tree mortality, and the paper we comment on more or less fits in that vein, implying that what we saw in June 2021 was just another example of drought damage and that the heat dome was a sort of extreme drought event. '' Still and OSU colleagues including ecologist and plant pathologist Posy Busby, H.J. Andrews Experimental Forest Director Mark Schulze, forest health specialist David Shaw, hydrologist David Rupp and geospatial climatologist Chris Daly say that damage can be driven by extreme heat alone, irrespective of prior hydrologic context and water availability. They note that the heat dome was one of the most extreme heat waves ever recorded anywhere in the world and the most intense ever in the Northwest. The scientists also point out that there is `` a clear distinction in the climate and hydrometeorological literature between droughts and heat waves '' and that `` heat waves are not just associated with droughts, as is commonly assumed, but are increasing in frequency during both wet and dry conditions. '' Among coastal Douglas-fir and western hemlock plantation forests in western Oregon and Washington, the most extensive impacts of the heat dome were in areas experiencing comparatively low levels of drought, the authors say. Conversely, many forests around Oregon's Willamette Valley and along the western slopes of the Cascade Range that were experiencing severe to exceptional drought during the heat dome showed less foliar damage. `` It's also important to remember that conifer needles can discolor for many reasons besides being dried out, '' Still said. Much of the observed `` foliar scorch '' resembled what is caused by heat generated from fires, Still said, and also followed patterns that suggest heat was the primary driver of foliar damage during the heat dome. Trees on south- and west-facing slopes and on exposed edges near roadsides generally showed the greatest scorch, and opposite sides of the same trees, or other trees on the same hillsides, displayed little to none. `` The scorching that did occur happened fast, within days and sometimes hours, much faster than would typically be associated with a malfunction of the trees ' water moving capabilities, '' Still said. `` And the prevalence of scorching in sunlit foliage also challenges the hypothesis that drought and hydraulic failure combined to be the primary cause of leaf damage. '' `` Our prior work has shown drought-induced foliar browning in conifers can take weeks or even months to appear after lethal levels of drought stress, '' added co-author William Hammond, an assistant professor of plant ecophysiology at the University of Florida. The scientists emphasize that they are not saying hydraulics played no role in the leaf damage, or in the subsequent death of some trees, but that extreme heat is the best explanation for the crown- and landscape-scale scorch patterns seen throughout the Pacific Northwest during and after the heat dome. `` Disentangling drought from heat damage is tricky, and we argue the research community needs to work much more on heat stress physiology, '' Still said. `` We need to explore connections between hydraulic properties and heat tolerance – safety margins, how evolution may have helped some species with heat tolerance, canopies ' ability to maintain leaf temperatures below damaging thresholds. What happened during the heat dome argues for a renewed emphasis on understanding the underlying physiological and biophysical mechanisms that can lead to heat resilience. '' College of Forestry research associate Adam Sibley is a co-author of the commentary, as are scientists from the U.S."),
Document(metadata={'domain': 'cleantechnica', 'id': 508, 'title': 'World’ s Largest Floating Solar Array, Manchin Movement On Climate — Nexus News Roundup', 'url': 'cleantechnica.com/2022/01/06/worlds-largest-floating-solar-array-manchin-movement-on-climate-nexus-news-roundup'}, page_content='It’ s a good problem to have. It’ s a lot better than the problem of ‘ we have no resources at all.’ ” A raging urban firestorm fueled by hurricane-force winds incinerated nearly 1,000 homes and structures in and around Boulder, CO in just one day, making it the most destructive wildfire in the state’ s history. “ With CLIMATE CHANGE, there is no FIRE SEASON anymore, ” tweeted Mike Nelson, chief meteorologist for Denver7, the city’ s ABC affiliate. Climate change, primarily caused by the extraction and combustion of fossil fuels, supercharges fires like the Marshall Fire through increased temperatures and exacerbated drought. The blaze, which ignited Thursday, was effectively extinguished by snowfall by the next day. More than 30,000 people were forced to evacuate but just two people were missing as of Monday. How to give and get help ( Boulder Daily Camera), how climate change primed Colorado for a rare December wildfire ( CNBC), fires outside of Denver were the most destructive in Colorado history ( NPR), climate change-fueled blaze destroys 1,000 homes in Colorado in rare winter wildfire ( Democracy Now), climate scientists grapple with wildfire disaster in their backyard ( Axios), photos: wildfires engulf 1,000 homes in suburban Denver ( NPR). A syndicated newswire covering climate, energy, policy, art and culture.'),
Document(metadata={'domain': 'energy-xprt', 'id': 1648, 'title': 'Solar Regulations ( Solar Energy) Articles', 'url': 'energy-xprt.com/solar-energy/solar-regulations/articles'}, page_content='manufacturing—and the jobs that go with it—have been steadily increasing since 2010. As President Obama mentioned during last month’ s State of the Union address, the U.S. economy added 568,000 new manufacturing sector jobs between January 2010 and December 2013. Meanwhile, industry—of which manufacturing is the largest component—reduced its energy-related CO2... The Black Sea, a highly isolated inland sea, is the largest anoxic zone in the world. Since the hydrogen sulphide ( H2S) zone was discovered in the early 19th century in the Black Sea, it has been accepted that there is no life in the depths of the Black Sea and only bacteria live in the H2S layer. A high content of organic matter, with maximum processes of bacterial sulphate reduction, is the... This study provides an easy to follow description of the second law ( of thermodynamics) method as applied to a single effect absorption refrigeration cycle. Two different sets of working fluids: LiBr-H2O and NH3-H2O solution are considered, and a method to calculate the entropy of both working fluids is offered. The total exergy destruction in the system as a percentage of the exergy input from... This paper presents a statistical model that is able to predict carbon monoxide ( CO) concentrations as a function of meteorological conditions and various air quality parameters. The experimental work was conducted in an urban atmosphere, where the emissions from cars are prevalent. A mobile air pollution monitoring laboratory was used to collect data, which were divided into two groups: a... Need help finding the right suppliers?')],
'question': 'Where are the biggest increases in wildfire smoke exposure in recent years?',
'answer': "I don't know."}
hyde_retriever_patent = HyDERetriever(store=best_store_patent, k=3)
hyde_chain_patent = create_qa_chain(hyde_retriever_patent)
hyde_chain_patent.invoke("Where are the biggest increases in wildfire smoke exposure in recent years?")
{'context': [Document(metadata={'id': 238, 'title': 'Method for cleaning the air basins of cities from smog and pollutants in the surface layer', 'topic': 1}, page_content='Due to the difference in air temperature near the water surface and the heated layers in the tower, an upward air flow is created. They use a tower installed with the help of support platforms at the base on two barges separated from each other, oriented against the flow of the river in the longitudinal direction, and a tower in the form of a cone with a black coating of its fence. EFFECT: method provides for the creation of natural convection of air masses by thermal heating of air in the tower due to solar energy and obtaining an upward air flow that promotes the appearance of circulation movement near the tower, the creation of thrust, removal and dispersion of harmful impurities from the surface zone into the upper atmosphere. 2 cl, 2 dwg'),
Document(metadata={'id': 580, 'title': 'Offshore power generation device based on wind energy and tidal current energy', 'topic': 3}, page_content='The energy conversion efficiency in the sea area is greatly improved.'),
Document(metadata={'id': 237, 'title': 'Method for cleaning the air basins of cities from smog and pollutants in the surface layer', 'topic': 1}, page_content='Method for cleaning the air basins of cities from smog and pollutants in the surface layer: FIELD: meteorology. SUBSTANCE: invention relates to the field of applied meteorology and ecology. The method consists in creating a natural convection of air masses by thermal heating of the air in the tower to obtain an upward air flow that promotes the appearance of circulation movement near the tower, the creation of thrust, removal and dispersion of harmful impurities from the surface zone into the upper atmosphere. Solar energy is used as a source of heat to heat the air in the tower. As a cold source, that is the surface of the water in the river.'),
Document(metadata={'id': 1138, 'title': 'Pressure control-based heat pipe type solar monitoring and early warning system', 'topic': 8}, page_content='Pressure control-based heat pipe type solar monitoring and early warning system: The invention relates to the field of solar energy monitoring, which is used for solving the problems that the solar energy lacks an effective monitoring system, a pressure relief fault cannot be found timely, and potential safety hazards exist, in particular to a heat pipe type solar energy monitoring and early warning system based on pressure control; according to the invention, when solar energy is monitored, through comprehensively collecting temperature data and pressure data, collecting the data in a fusion way and processing the collected data in a formula quantitative analysis way, not only can the pressure temperature of the solar energy be monitored in real time, but also the future pressure temperature condition can be scientifically predicted, so that the stability of the solar energy in the use process is improved, and through actively monitoring the temperature and the pressure of the solar energy, not only can sound and light alarm be generated to remind a user, but also the pressure relief device can be actively opened when the solar energy is abnormal in operation, thereby avoiding the damage of the solar energy and improving the intelligent degree of the solar energy.'),
Document(metadata={'id': 501, 'title': 'Air conditioner control method and device', 'topic': 2}, page_content='Air conditioner control method and device: The invention provides an air conditioner control method and device, belonging to the technical field of air conditioners, wherein the method comprises the following steps: acquiring the current environment state of an area where an air conditioner is located; acquiring the total power consumption of the air conditioner according to the current environment state and the target environment state; the total power consumption is the total power consumed by adjusting the current environment state of the area where the air conditioner is located to a target environment state; and determining a power supply mode for the air conditioner according to the total power consumption and the available power of the wind-solar hybrid system. According to the air conditioner control method and device, the total power consumption of the air conditioner is predicted according to the current environment state of the area where the air conditioner is located and the target environment state set by a user; furthermore, the power supply mode of the air conditioner is determined according to the total power consumption and the available power of the wind-solar hybrid system, so that the electric energy of the wind-solar energy storage system can be reasonably and effectively utilized, and the energy is saved.'),
Document(metadata={'id': 353, 'title': 'New energy vehicle air conditioning system and energy-saving control method', 'topic': 1}, page_content='New energy vehicle air conditioning system and energy-saving control method: The invention discloses a new energy vehicle air conditioning system, which comprises an A three-way valve, wherein the A three-way valve is provided with two air inlet interfaces and an air outlet interface, the two air inlet interfaces are respectively connected with an air inlet pipe communicated with the outside and an internal circulation air inlet pipe communicated with the inside of an automobile, the air outlet interface of the A three-way valve is connected with a filter through an air pipe, the filter is connected with a heating adjusting device through an air pipe, the heating adjusting device is connected with an air inlet pump through an air pipe, the air inlet pump is connected with an air outlet of an air conditioner of the automobile through an air pipe, and the heating adjusting device is used for electrically heating and/or heating air by solar energy. The battery is in a proper temperature environment, and the problem of capacity reduction of the lithium battery caused by low temperature is avoided.'),
Document(metadata={'id': 259, 'title': 'Energy supply system for coupling thermochemical heat accumulation and air circulation heat pump air conditioner', 'topic': 1}, page_content='Energy supply system for coupling thermochemical heat accumulation and air circulation heat pump air conditioner: The invention belongs to the field of green buildings and renewable energy source utilization, and provides an energy supply system for coupling thermochemical heat accumulation with an air circulation heat pump air conditioner, which comprises an air circulation heat pump air conditioner subsystem, a thermochemical heat accumulation subsystem and a heat supply subsystem, wherein the temperature of solar energy and low-grade energy is increased through an air circulation heat pump, and obtained high-temperature air is sent to the thermochemical heat accumulation subsystem and is used for the air conditioner by virtue of low-temperature air generated by an expansion machine; when heat is needed, the thermochemical heat storage subsystem releases heat to supply heat to the indoor space and provide domestic hot water. According to the invention, the air circulation heat pump is combined with thermochemical heat storage, so that the problems that heat storage is difficult to realize in a cross-season mode and low grade energy is difficult to recycle in a thermochemical heat storage mode due to large heat loss in the traditional heat storage mode are solved; meanwhile, the electric turbocharger is used, so that the problem that the expander is difficult to miniaturize is avoided; the system can supply cold, heat and heat to the building, improves the energy utilization rate, and is beneficial to low-carbon operation of the building.'),
Document(metadata={'id': 1095, 'title': 'Novel cross shutter type respiratory curtain wall system and control method thereof', 'topic': 7}, page_content='Novel cross shutter type respiratory curtain wall system and control method thereof: The invention discloses a novel cross-shaped louver type respiratory curtain wall system and a control method thereof, the system comprises a double-layer glass curtain wall structure, a fire-fighting louver and a fire-fighting linkage control unit, the double-layer glass curtain wall comprises an opening window fixed on the outer wall surface of a building main body, an outer layer glass curtain wall fixed on the building main body, and a ventilation channel arranged between the opening window and the outer layer glass curtain wall, the outer layer glass curtain wall is a cross-shaped photovoltaic louver, the cross-shaped photovoltaic louver comprises a light-transmitting blade, an axle center and a photovoltaic blade, the light-transmitting blade and the photovoltaic blade are fixed on the axle center, the light-transmitting blade and the photovoltaic blade are crossed in a cross shape, and the crossing point is the axle center. In four seasons, the cross photovoltaic shutter converts solar energy into electric energy, and energy storage is realized through the energy storage device arranged in the machine room in the building, so that the heat preservation and heat insulation functions, the sun shading functions, the ventilation functions, the smoke prevention functions and the photovoltaic power generation functions are integrated.')],
'question': 'Where are the biggest increases in wildfire smoke exposure in recent years?',
'answer': "I don't know."}
def run_hyde_strategy(eval_df, strategy_name):
# Initialize a dictionary to store the results
strategy_results = {}
# Step 1: Create or get the dataset based on strategy
if "media" in strategy_name:
key = "hyde_media"
elif "patent" in strategy_name:
key = "hyde_patent"
# Create or get the dataset
datasets[key] = get_or_create_eval_dataset(key, eval_df, chain)
# Step 2: Run or retrieve the evaluation results
llm_results[key] = get_or_run_llm_eval(key, datasets[key], llm)
# Step 3: Store results and plot
strategy_results[strategy_name] = llm_results.get(strategy_name, {})
strategy_results[key] = llm_results[key]
# Step 4: Plot the results (for visualization)
plot_multiple_evals(strategy_results)
For media dataset¶
run_hyde_strategy(eval_df=eval_df_media.sample(100, random_state=42), strategy_name="bge-m3_semantic_media")
Generating hyde_media: 100%|██████████| 100/100 [01:47<00:00, 1.08s/it]
Saved hyde_media dataset to data/data_new/gold/datasets/hyde_media_dataset.json
Evaluating: 0%| | 0/400 [00:00<?, ?it/s]
Saved hyde_media evaluation results to data/data_new/gold/results/hyde_media_llm_eval_results.csv
We can see that on average the context relevancy is the only matrix which increase when using Hyde approach. Original BGE-m3 semantic model performs way better than the Hyde approach for media dataset overall for correctness, faithfullness and answer relevancy.
For Patent Dataset¶
run_hyde_strategy(eval_df=eval_df_patent.sample(100, random_state=42), strategy_name="bge-m3_semantic_patent")
Generating hyde_patent: 100%|██████████| 100/100 [02:38<00:00, 1.59s/it]
Saved hyde_patent dataset to data/data_new/gold/datasets/hyde_patent_dataset.json
Evaluating: 0%| | 0/400 [00:00<?, ?it/s]
WARNING:ragas.llms.output_parser:Failed to parse output. Returning None. WARNING:ragas.llms.output_parser:Failed to parse output. Returning None.
Saved hyde_patent evaluation results to data/data_new/gold/results/hyde_patent_llm_eval_results.csv
There is a bit icrease in the correctness matrix. But all other matrics are comparable but no improvement.
Other Methods¶
There are many other methods that can be used to improve the RAG pipeline. Some of these include:
- Step Back: Where the idea is to take a step back and understand the concepts and context of the user query and then use this information to retrieve the most relevant chunks.
- Hybrid Search: Where the idea is to not only use semantic search but also lexical search to retrieve the most relevant chunks and combine the results with a re-ranking step.
os.system("jupyter nbconvert --to html --template pj cleantech_rag.ipynb")
65280
!jupyter nbconvert --to html --execute --embed-images "/content/cleantech_rag_updated2 (6).ipynb"
[NbConvertApp] Converting notebook /content/cleantech_rag_updated2 (6).ipynb to html 0.00s - Debugger warning: It seems that frozen modules are being used, which may 0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off 0.00s - to python to disable frozen modules. 0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation. 0.00s - Debugger warning: It seems that frozen modules are being used, which may 0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off 0.00s - to python to disable frozen modules. 0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation. [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request [NbConvertApp] ERROR | unhandled iopub msg: colab_request
from pathlib import Path
import shutil
from google.colab import files
# Path to the folder you want to zip
data_folder = Path("/content/data")
# Create zip file (this will be /content/data.zip)
zip_path = shutil.make_archive(str(data_folder), 'zip', root_dir=str(data_folder))
# Download the zip file
files.download(zip_path)